arXiv:1504.07972vl [math.ST] 29 Apr 2015 


Adaptive Bayesian credible sets in regression with a 

Gaussian process prior 

Suzanne Sniekers* and Aad van der Vaart^^ 

Mathematical Institute 
Leiden University 
P.O. Box 9512 
2300 RA Leiden 
The Netherlands 

e-mail: suzanne . sniekersSmath. leidenuniv.nl 
e-mail: avdvaartOmath. leidenuniv. nl 
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1. Introduction and main result 

We consider the fixed design regression model, where we observe a vector Yn := 
■ ■ ■, with coordinates 

^,n — f (^i,n) A~ t G {1, . . . , Tl}. (f-f) 

Here the parameter / is an unknown function / : A —)• M on some set X, the design points 
(xi^n) are a known sequence of points in X, and the (unobservable) errors Si^n are independent 
standard normal random variables. We are interested in the performance of a nonparametric 
Bayesian approach that uses a scaled Gaussian process \fcW as a prior on /. We investigate its 
efficiency to reconstruct the true regression function, and its ability to quantify the remaining 
uncertainty in the statistical analysis through the full posterior distribution. Our main interest 
is in the dependence of the posterior distribution on the scaling factor y/c in the Gaussian 
process, which can be viewed as a bandwidth parameter that can adapt the prior and posterior 
distributions to the unknown regularity of the regression function. We consider empirical and 
hierarchical Bayes methods to determine this scaling factor, and study the properties of the 
resulting plug-in or full posterior distributions. 

We denote the prior process for / = (/(x) : x £ X) hy = (WJ : x G X), where c is the 
scaling factor, and it is assumed that the process is equal in distribution to the process 
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y/cW^. The index set T may possess a special structure, but the general results allow it to 
be arbitrary. These results cover both one-dimensional and multidimensional domains T. 

As a particular example we consider the case that X = [0,1] and is a standard 
Brownian motion. In this case is a mean-zero Gaussian process with covariance func¬ 
tion FjW/W/ = c{s At), and can also be obtained by taking a standard Brownian motion 
on the transformed time scale ct. More generally, for every self-similar process of order 
a the process {^/cWl : t > 0) is equal in distribution to {W^^i/{2a) '■ t > Q) and hence our 
present sense of scaling is equivalent to changing the length scale of the standard process. 
This applies in particular to multifold integrals (indefinite integrals) of Brownian motion, as 
considered in [12] in connection to spline smoothing. 

For a given scale c the Bayesian model is then described by 


/ I c ~ IT^ 

kn|/,C~ fn = (/(xp^), . . . ,/(Xn,„))^. 


( 1 . 2 ) 


The posterior distribution given c is by definition the conditional distribution of / given (1^, c) 
in this setup. As Yn depends on / only through fn, the conditional distribution of / given 
{Yn,fn,c) does not depend on the data Yn and is the same as the conditional distribution 
of / given (/n,c), which is determined by the prior only. Thus we focus on the posterior 
distribution of fn, which by standard Gaussian calculus can be seen to satisfy 


fn\Yn,C^J\fn {fn,c, I " , fn,c = {I - 


— I cUn 


(1.3) 


for Un the covariance matrix of the unit scale process restricted to the design points Xi^n- 
For instance, for scaled Brownian motion {Un)i,j = Xi^n A Xj^n- 

If Yn follows the model (1.1) with a continuous function /, then for fixed c the posterior 
mean fn,c tends to /„ and the posterior covariance matrix I — tends to zero as n —>■ oo 
(see [5, 20]). This remains true if c = is made dependent on n and allowed to tend to 
zero or infinity at polynomial rates. Thus the posterior distribution given c = Cn contracts 
to the Dirac measure at / for reasonable Cn- The rate of contraction depends on Cn and the 
regularity of the function / jointly. A smaller value of c corresponds to less variability in the 
prior process, and yields a posterior distribution with a less variable mean function and a 
smaller covariance. This is advantageous if the true regression function / is fairly regular, but 
will lead to a suboptimal contraction rate and a too optimistic quantification of remaining 
uncertainty in the opposite case (see [19, 16]). It is therefore important to adapt c to the 
data. We discuss three methods, which turn out to have similar behaviour, both in terms of 
contraction rate and uncertainty quantification, although the sets of functions for which they 
work differ. 

In the hierarchical Bayes setup the parameter c is equipped with a prior, and an ordinary 
Bayesian analysis is carried out with the resulting mixture of normals prior for /. We shall 
consider the situation that c follows an inverse Gamma distribution. 

In the empirical Bayes setup an estimator Cn of the length scale is plugged into the posterior 
distribution for given c. We consider two methods of estimation: a likelihood-based and a risk- 
based method. 

The likelihood-based empirical Bayes method defines Cn as the maximum likelihood esti¬ 
mator of c within the marginal Bayesian model Yn\c ^ AA(0, Y,n,c), which follows from (1.2). 
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In this marginal model c is the only parameter, and its maximum likelihood estimator is 


Cn = argmm 

C&In 


log det S 


n,c 


I yTy-ly 

^ ^n,c-^rL 


(1.4) 


The restriction of c to an interval In away from the extremes 0 and oo is convenient. Through¬ 
out the paper we shall use 

In = [logn/n,n”'"^], 

where m is chosen large enough so that the minimax scaling rates for all smoothness levels are 
included. (If (1.12) holds, then it is chosen equal to the m in this equation.) The likelihood- 
based empirical Bayes procedure ought to be close to the hierarchical Bayes procedure, as 
the posterior density for c is proportional to the marginal density of Yn given c times the 
prior density by Bayes’s rule, and hence ought to concentrate around c„ in (1.4). Thus the 
posterior distribution with a likelihood-based empirical Bayes plug-in for the scale parameter 
is sometimes viewed a computationally cheaper version of a true Bayesian analysis. 

The risk-based empirical Bayes method uses an alternative estimator for c that tries to 
minimize the risk of the posterior mean fn^c-, which is given by 

E/||/n,c - fnf = II - Klfnf + tr((/ - (1.5) 


The first term on the right depends on the unknown function /, and hence cannot be used in 
a criterion to estimate c. An obvious estimate for this term is || — but it is biased, 


as 


E, 


— 

^n.c-'n 


= iiE-'/; 


+ E/ 


\y~^r 


= l|E-'/„ 


+ ll'(En,c)- 


This motivates the estimator for c given by 


Cn = argmm 

C&In 


tr((/ - s-|,) 2 ) - tr(S- 2 ) + 


( 1 . 6 ) 


In the special case that is an (m — l)-fold integral of Brownian motion, this estimator 
was introduced in the context of regression by spline-smoothing. The posterior mean in our 
setup is then equal to a penalized least squares estimator for the penalty A f dx, with 

smoothing parameter A equal to l/(cn). See [ 22 , 5]. 

In Bayesian inference the posterior distribution is used both to reconstruct the regression 
function /, typically by the posterior mean, and to quantify the uncertainty in this construc¬ 
tion, using the spread of the posterior distribution. In this paper we are interested in the 
accuracy of these procedures within the so-called frequentist setup, which assumes that the 
data Yn are generated according to model (1.1) for a given “true function” /. The accuracy 
of the posterior mean as a point estimator of / can be measured by its risk function or the 
contraction rate of the full posterior distribution (see [ 8 ]), as usual. The accuracy of the un¬ 
certainty quantification can be studied through the coverage and size of credible sets, which 
are data-dependent sets of prescribed posterior probability. In connection to the empirical 
Bayes methods we shall first study credible sets of the form 

Cn,ri,M = {/ : Wfn “ /n,c„|| < Mr„(Cn,77)}, (1.7) 

with II • II the Euclidean norm. Here rn{c,r]) is determined, for given 77 G (0,1), such that the 
ball of radius rn{c,r]) centered at the origin receives probability p under the posterior law of 
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fn — fn,c given a fixed c, which by (1.3) is the normal law A/’n(0, 1 — In the hierarchical 

Bayes setup we first select a pair of (nontrivial) quantiles ci^nitli) < C 2 ,n(f?i) in the posterior 
distribution of c, the distribution of c 1in the Bayesian model (1.2) augmented with a prior 
on c. We then consider as credible sets for /: 

Cn,r,,M = U {f -Wfn- fn,c\\ < (1-8) 

Cl,n(Vl)<C<C2,n{Vl) 

This two-step construction can exploit that the credible sets for fixed c have a simple de¬ 
scription through the radii r„(c, r/). An alternative would be a ball around the hierarchical 
posterior mean f fn^c^n{dc\Yn). 

The uncertainty quantification, by either (1.7) or (1.8), is deemed accurate if the sets Cn,ri,M 
cover the true parameter / with high probability, if the data are generated according to the 
model (1.1). In particular, the credible sets are honest confidence sets at level r] for a given 
class of functions J- if 

inf P/(/ e Cn,^,M) > V- 

J 

The number rn{c,rj) is the natural radius of the credible set for fixed c at level r] in the 
Bayesian framework. The additional constant M in the definitions (1.7)-(1.8) of the credible 
sets is required because the Bayesian and frequentist notions of coverage are not the same, 
and c is estimated. 

It is well known that the size of an honest confidence set for a given model J- is determined 
by “worst case” members of J- [14, 11, 3, 4, 15, 7, 10]. For instance, if J- contains a Holder 
ball of regularity a, then the (random) diameter of the confidence set cannot be of smaller 
order than even if the true function is much smoother. In other words, the 

size of honest confidence sets cannot adapt to the unknown smoothness of the true regression 
function. On the other hand, the posterior contraction rate of the hierarchical Bayes method 
is known to adapt to unknown regularity, in that the rate is faster if the true function is 
smoother. We show below that the empirical Bayes methods adapt in a similar manner. Since 
the corresponding credible sets will have diameter of order the contraction rate, it follows that 
these sets cannot be honest over a “full” set of functions, such as a Holder ball. Following 
[9, 1, 18] we lower our expectation and investigate honesty over a reduced parameter space, 
with certain “inconvenient” true parameters cut out, as follows. 

The distribution of the data depends on the function / only through the vector /„. A 
convenient way to describe this vector is through its coordinates relative to the eigenbasis of 
the covariance matrix Un- Write /i,n) ■ ■ ■ , fn,n for the coordinates of fn relative to this basis, 
i.e. 

fj,n ■— fn ^j,ni j £ {Ij ■ ■ ■ j tl} , 

for ei^n,..., Cn^n the orthonormal eigenbasis of Un- Let Ai^n,..., An,n be the corresponding 
eigenvalues. 

Definition 1 (Discrete polished tail). We say that the function /, or the corresponding array 
{fj,n), satisfies the polished tail condition if there exist constants L and p such that for all 
c > 0 and sufficiently large n it holds that 

r E /A £ E fU 
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The condition may be paraphrased as requiring that the “energy” of the signal / in the 
“large frequencies” {j : 1 < cXj^n < p} is at least a fraction L~^ of the “energy” in the 
“frequencies” {j : cXj^n < !}• Perhaps a better name would be “self-similar”, but this name is 
already taken in the literature for a more special property. The following example shows that 
the condition is similar to the polished tail condition introduced in [18] when the eigenvalues 
decrease polynomially in j. 

Example 2 (Polynomial eigenvalues). If Xj^n ^ for some constants and k > 0, 

then the discrete polished tail condition is equivalent to the existence of constants L and p 
such that, for all sufficiently large m (and hence sufficiently large n), 

n pmAn 

E /b S i E tin- 

j=m j=^ 

Indeed, the condition cXj^n < 1 is equivalent to j > {cKnY/^ =: J, whence the right side of 
(1.9) is bounded above by '/2j>jfjn^ which is bounded above by LYlij<j<jpf‘jn t’Y (ITO). 
This is the left side of (1.9), with p~^ instead of p. 

In [18] a condition similar to (1.10) is introduced in a continuous time setup. We comment 
on the relationship of these conditions in Section 4. 

The main result of this paper is that all three types of credible sets are honest confidence 
sets over polished tail parameters, of diameter that adapts to the smoothness of /. We measure 
smoothness through the square norms, for a > 0, 

ll/lln,a = 

^ = 1 ( 1 . 11 ) 

ll/lln,a,oo = E 


These norms are in terms of the restriction of / to the grid (xj^n)- We comment on their 
relationship to norms on the full function / in Section 4. (In general the coefficients fj^n 
cannot be directly related to an infinite sequence of Fourier coefficients of /, but for many 
functions the numbers fj,n/V^j which include the scaling factor ^/n, is close to the Fourier 
coefficient.) 

In the following theorem we assume that there exist constants 0 < J < <5 < oo and m > 1 
such that the eigenvalues Ai^„,..., Xn^n of Un satisfy 





— ^j,n ^ 



( 1 . 12 ) 


Since Wn is distributed as Xj^nZj^n&j,n for i.i.d. standard normal random variables 

Zj^ni we have E||IT||^„ = For the eigenvalues (1.12) this is uniformly 

bounded if and only if a < (m — l)/2. Thus these eigenvalues correspond to modelling the 
regression function a-priori as “almost {m — 1)/2-smooth”. 

Let J-n,L be the set of all functions that satisfy the discrete polished tail condition (1.10) for 
given L and satisfy fj,n — dn for some sufficiently small constant d (that may depend 

on 5 and m). 
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Theorem 3. Assume that (1-12) holds. For sufficiently large M and any rj > 0 the credible 
sets (1.7), with Cn given by (1-4) or (1.6), and the credible sets (1.8) satisfy 

inf Pfif £ Cn,rj,M) 1- 

f&.rn,L 

Furthermore, for any a G (0,m/2), the diameter of the credible sets Cn,p,M relative to the 
scaled Euclidean norm || • is of the order Op^ uniformly in f with ||/||n,a ^ 1 

or ||/||n,a,oo ^ 1- For the risk-based empirical Bayes method this is even true for a G (0,m). 

The theorem is a summary of the main results of the paper as valid for all three methods. 
More specific results for the individual methods, with relaxations of the polished tail condition 
tailored to the specific method, as well as results that do not assume the eigenvalue condition 
(1.12), are described below. For example, these results cover functions / on a two-dimensional 
domain with eigenvalues of the forms (1.19) or (1.20), as introduced below. 

The second and third assertions of the theorem show that the diameter of the credible sets 
adapts to the regularity of the true regression function. The restrictions to regularity levels 
a < m /2 or a < min the likelihood-based and risk-based methods stem from the prior, 
through the rate of decrease (1-12) of its eigenvalues, and the method used. The range (0,m) 
is bigger than could be expected from the existing literature on Gaussian process priors. For 
instance, (m/2 — l)-fold integrated Brownian motion satisfies (1.12) and has sample paths 
of regularity m/2 — 1/2. It has been documented to be an appropriate prior for functions 
of exactly regularity m/2 — 1/2, and to become appropriate for functions of regularities a G 
(0, m/2] after appropriate (deterministic) scaling [16, 19, 13]. The latter property is retained 
under random scaling by likelihood-based empirical Bayes and hierarchical Bayes methods 
considered in the present context (although for a = m /2 an extra logarithmic factor may 
come in; see Example 23; the definitions of regularity in the various papers are also not 
directly comparable). Surprisingly the risk-based method performs better than the likelihood- 
based methods, in that it enlarges the good range to a G (0,m). This is caused by the closer 
connection of the risk-based empirical Bayes method to the diameter of the credible set, 
yielding a more appropriate scaling factor Cn for minimizing this diameter. 

The diameter of the credible sets is linked to the posterior contraction rate. The rates 
are attained irrespective of / satisfying the polished tail condition, the latter 
condition being important only for the coverage. 

The credible sets (1.7) and (1.8) are obtained by considering balls in the space of function 
values of / at the design points. An alternative are (sets based on) pointwise intervals of the 
form 

Cn,r,,M{x) = {f :\f{x) - fn,Cr,{x)\ < Mr„(c„, 7?, x)} 

or 

Cn,ri,M{x) = IJ {/ : \ f{x) - fn,c{x)\ < Mr„(c, 772 , x)}, 

ci,Ti(?7l)<c<C2,n(r7l) 

where /n,c(x) denotes the mean of the marginal posterior distribution of f{x) given c and 
rn{c,T],x) is determined so that 

P{\f{x) - fn,c{x)\ < rn{c, T], x) | c) = T]. 

Since this marginal posterior distribution of /(x) given c is normal with mean fn^dx), these 
intervals are easily determined. In particular, for a design point x = Xi^n the radius rn{c,ri,x) 


(1.13) 

(1.14) 
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is equal to z^(l — for Zrj the (l + r/)/2-quantile of the standard normal distribution. 

When used simultaneously for multiple values of x, these intervals form a credible band. 

The study of the coverage of such pointwise intervals and bands requires different techniques 
from those in the present paper, and appears to be tractable only for concretely specified prior 
processes. However, the methods developed here are suitable when measuring coverage in an 
averaged fashion that focuses on the fraction of the design points at which the intervals (1.13) 
or (1.14) cover the true function. A similar point of view was taken by [22, 2]. The following 
corollary gives such a result for a subset of design points Xi^n that are spread evenly relative 
to the prior process. More precisely, let 


sl(c,Xin)-= inf 

aSR" 


'cE{Wl^-a^W^f + \\a\\ 


denote the posterior variance at the design point Xi^n and set 


Jn • — 



(1.15) 


for some constant C that is independent of n. Then the corollary holds when considering the 
design points in this set. 

In Corollary 3.6 of [16], we have seen that Brownian motion satisfies this condition for the 
set of all design points that satisfy Xi^n > C/y/\ogn. 

The following corollary shows that the uncertainty quantification through the intervals 
Cn,ri,M{xi,n) IS correct at the design points in the set as long this set is large enough, 
except possibly a fraction. 

Corollary 4. Assume that (1.12) holds and that the set J„ given in (1.15) satisfies \ Jn\ ~ n. 
Fix 7€(0,1), r/>0 and let Cn be given by (1./) or (1.6). Then for sufficiently large M the 
credible sets defined in either (1.13) or (l.lf) satisfy 

inf Pf(- l{/ e Cn,n,M{xi,n)} > 7 ) 1- 

n,L ^Tl / 


Furthermore, if for i C Jn it also holds that s‘(^{c,Xi^n) < ^xj^n) for some 
C > 0, then for any a € (0, m/2) the length of the intervals Cn,r],M{xi,n) is of the order 
uniformly in i G Jn, uniformly in f with \\f\\n,a < 1 or ||/||n,Q,oo < 1- For 
the risk-based empirical Bayes method this is even true for a G (0,m). 

The proof of this corollary can be found in Section 7. 

The multiplicative constant n in (1.12) is motivated by comparison with the continuous 
time setup. If the covariance function K{s,t) = EW}W( of the continuous time process 
has eigenfunctions ej satisfying 


J K{s,f)ej{f) ds = XjCj^s), 


then for equidistant design points one may expect that 

n 

i=l 
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This suggests both that Xj^n ~ nXj and that the “discrete” eigenvectors ej^n shonld be close 
to the eigenfunctions restricted to the design points. This is a suggestion only, which already 
makes little sense when counting the nnmbers of eigenvalues involved: n versus oo. Neverthe¬ 
less, for the Brownian motion prior the correspondence is exact. 

Example 5 (Brownian motion). The Brownian motion prior permits explicit formnlas for 
eigenbasis and eigenvalues, provided the design points are taken eqnal to Xi^n = i/{n + 1/2) 
for i G {1,... ,n}, a slight shift from the usual uniform grid. The formnlas are interesting as 
they allow to make a connection to the Fourier basis (see Section 4). 

The eigenvectors of the covariance matrix Un of standard Brownian motion, scaled to unit 
length, are given by, for j G {1,... , n}, 

^ 3 ,n = / ^ , {ej{xi,n), ■ ■ ■ ,ej{xn,n))'^, ej{x) = V2sin[(j - Dttx]. (1.16) 

i/n -h 1/2 

The functions ej are an orthonormal basis of {/ G ^ 2 ( 0 ,1] : /(O) = 0}, and happen to be eigen- 
fnnctions of the covariance kernel of continnons Brownian motion. A similar correspondence 
is valid for Brownian bridge, bnt we are not aware of other examples where the continuous 
and discrete setnps match up so closely. 

The eigenvalues of Un are given by 


(4n-|-2) sin^((j — l/2)7r/(2re-|-1)) 


As the argnment of the sine is in [0,7r/2], for which 2x/Tr < sinx < x, there exist nnmbers 
{6, 6) such that 


n 1 /2n -|-1 

- j2 — _j_ 2)7 j- 2 V j _ 1 


2 

^ ^j,n ^ 


1 /2n + 1 

16n + 8 V j - 1 



(1.17) 


where this inequality holds for all n and j > 1 if we take (5, 6) = (vr“^, 3), and for j > 2 and 
n sufficiently large if we let <5 = 4/10. 

Standard Brownian motion has sample paths of regularity 1/2, and has been documented to 
become an appropriate prior for fnnctions of regularities a G (0,1) after appropriate scaling 
[16, 19, 13]. We show in the present paper that the good range is enlarged to a G (0,2) 
provided that the scaling by the risk-based empirical Bayes method is used. 

Example 6 (Discrete priors). Although it often helps intnition to model a function / a- 
priori by a Gaussian process on a “continuous” space that encompasses the design points, 
nothing in the preceding setup requires this. In fact, we may tnrn the constrnction aronnd, 
by starting with an arbitrary orthonormal basis ei^„,..., en,n and eigenvalues ..., \n,n-, 
and next define the prior covariance matrix Un to be the matrix that has this as its eigenbasis 
and eigenvalnes, that is, its spectral decomposition is 


Un 


n 

i=l 


(1.18) 


Given arbitrary points xi^„,..., Xn,n the vector /„ is then a-priori modelled by its coefficients 
fi^n relative to ei^n, ■ ■ ■, en,n, which are independent A/(0, cAj^„)-variables. 
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One particular example is to retain the eigenvectors of Brownian motion, but to change the 
corresponding eigenvalues to (1-12) for a general m. The interpretation of the norms || • \\n,a 
and II • ||n,a,oo would be the same as for Brownian motion (as discussed in Section 4), but the 
good rates relative to these norms would now be attained for a up to m (or m/2) rather than 
2 (or 1). Our theoretical results show only advantages to taking a larger value of m, but one 
might guess that a deeper analysis could change this picture. 

Example 7 (Discrete Laplacian). The discrete Laplacian is a useful tool to construct “smooth 
priors” on a discrete set of design points. For a univariate grid it is closely connected to the 
Brownian motion prior of Example 5. For a countable set fh equipped with a neighbourhood 
relation ~ the Laplacian is the operator acting on functions f : X ^ M, defined by 


y.y^x 


Small values of |L/| indicate that / changes little across its neighbourhoods, whence L can 
be used to model smoothness relative to the given neighbourhood structure. 

Identihcation of a function / : T —)• M with the infinite vector (/(x) : x G T) gives an 
identihcation of L with an infinite matrix (with (x, element equal to 1 if y 7 ^ x and y ~ x; 
equal to —#{y ~ x} if y = x; and equal to 0 otherwise). The restriction of this matrix to 
the rows x G {xi^„,..., Xn,n} will have nonzero elements in columns y ^ ■ ■ ■ j Xn,n} with 

y ~ Xi^n for some i, and hence a restriction of Lf to the design points may not correspond to 
simply taking the appropriate in x n)-submatrix of L. This is typically solved by imposing 
boundary conditions, much as when considering a continuous partial differential operator. 

In the example of T" = Z with the design points xi^„,..., Xn,n identified with the points 
1,..., n and the neighbourhood system: i ~ j if and only if |i — j| = 1, the discrete Laplacian 
is 

L{f){i)= Y1 [/OO-ZW] =/(* + !)+ /(*-!)-2/(i). 

The restrictrion of L{f) to the design points 1,... ,n also involves the points 0 and n + 1, 
and there are various ways of imposing boundary conditions. The natural choice /(O) = 
/(n+1) = 0 is known as the Dirichlet boundary, while the other natural choice /(O) = /(I) and 
f{n + 1) = /(n) is the Neumann boundary. The eigenvectors and eigenvalues corresponding 
to these boundary conditions are known explicitly, and so they are for the mixed Dirichlet- 
Neumann conditions: /(O) = 0 and /(n + 1) = /(n). In fact, in the latter case the eigenvectors 
are exactly equal to ej^n as given in (1.16) and the eigenvalues are —l/((n+l/2)Aj,n) for Aj,n as 
given in (1.17). This close connection to Brownian motion is not obvious, but also not entirely 
surprising as minus the inverse Laplacian (the twofold primitive) is the covariance operator of 
Brownian motion (restricted to the orthocomplement of the constant functions) and standard 
Brownian motion is tied at zero. The connection invites to interpret the eigenvectors (1.16) 
as modelling smoothness in a discrete sense, an interpretation that also makes sense if the 
design points Xi^n are linearly ordered and roughly equally spaced, but not exactly equal to 
i/{n + 1/2) as in Example 5. For the special grid of the latter example the norm in (1.11) 
corresponds exactly to the size measured by the Laplacian, in that 


1 

n 




E 

i=l 


/, 




((n + l/2)Aj^n) 
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(The norm on the left side is the Euclidean norm of ffi” and the leading factor 1/n stabilizes the 
sum involved in this norm; the factor preceding L corresponds to for h ^ 1/n the mesh 
width of the grid.) Although the eigenvalues (1-17) come naturally with the discrete Laplacian, 
when defining the prior they might be replaced by eigenvalues (1-12) for a general m. This 
would correspond to describing a-priori smoothness by a power of the Laplacian. Indeed, as 
noted following (1.12), for these eigenvalues we have E||1 T||^q, < oo for a < (m — l)/2. In 
view of the preceding display, this is equivalent to finiteness of ^ E||(n^L)“iy„|p. So the prior 
with covariance matrix (1.18), for eigenvalues ( 1 - 12 ) and eigenvectors (1.16), corresponds 
to modelling / by a Gaussian process W with finite discrete Laplacian {n‘^L°‘)W for a < 
(m - l)/ 2 . 

Example 8 (Integrated Brownian motion). Once integrated Brownian motion W/ = 
f^Bgds, for B standard Browian motion, possesses covariance function coy{WI,W/) = 
s^{2>t — s )/6 for s <t. The eigenfunctions are given by 

ej{t) oc (sin 0 j -|- sinh 0 j) (cos(t 0 j) — cosh(f 0 j)) — (cos 0 j + cosh 0 j) (sin(t 0 j) — sinh(f 6 *j)), 

where the 0 j are the positive roots of the equation cos( 0 ) cosh( 0 ) = — 1 , for j G { 1 , 2 ,...}. 
ee [ 6 ], Theorem 7. The corresponding eigenvalues are Aj = and are of the order ((2J — 
l)7r/2)-l 

Thus this example appears to satisfy (1.12) with m = 4. However, exact expressions for 
the discrete eigenvectors and eigenvalues appear not known. 

Example 9 (Two-dimensional Brownian motion and variants). Functions / : [0,1]^ —M on 
the unit square may be modelled a-priori by a scaling of two-dimensional Brownian motion 
= (W/f : (s,t) G [0,1]^), which is the tensor product = Bi^sB 2 ,t of two independent 
standard univariate Brownian motions Bi and H 2 . The covariance function is the 

tensor product K{s, s')K{t,t') of the covariance functions K{s,s') = s A s' of the univariate 
Brownian motions. For a rectangular grid consisting of points (xi^n, Xj,n) constructed from a 
given univariate grid 0 < xi „ < • • • < Xn,n < 1) the covariance matrix of the n^-dimensional 
vector {Wxi (lj) ^ {Ij • • • with its coordinates ordered appropriately, is the 

Kronecker product of two copies of the covariance matrix of the n-dimensional vector {Bx^ ^) ■ 
The eigenvectors are the tensor products Ci^n G) of the univariate eigenvectors with 
corresponding eigenvalues the products Aij^n = Ai^nAj^n of the univariate eigenvalues 

Even though in this case the eigenfunctions and eigenvalues are more naturally viewed as 
a two-dimensional array than a sequence, they may of course be ordered in a sequence. Then 
this example fits the general setup, except that n has been changed into n^. 

In particular, for the grid in Example 5 the eigenvectors are the discretisations of the tensor 
products of the sine-basis given in (1.16) and the eigenvalues satisfy 

2 

^ ^rnjm ’ (b j) ^ • • • ! (1-19) 

for m = 2. Theorem 3, which assumes (1.12), does not apply to this example. However, the 
assumptions of the general results below are satisfied, also for a general value of m > 1 , 
and hence the message of the theorem goes through. The set of polished tail functions can 
be defined in the same manner by ( 1 . 10 ), after ordering the array of coefficients fij,n in a 
sequence by order of decreasing eigenvalues Aij^n (that is, increasing values of ij). 
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The square smoothness norm || • \\n,a as in (1.11) now becomes n~‘^ fij n- 

While the eigenbasis is essentially the natural two-dimensional Fourier basis, the restriction 
imposed by this norm is a bit unusual, in its focus on the cross product ij. As the smoothness 
norm describes the prior process, this may be unsatisfactory. More natural “Sobolev norms” 
X^ILi fij n correspond to the eigenvalues 


n 




(i2 j 


2\m ’ 


{i,j) G {l,...,n}^ 


( 1 . 20 ) 


The Gaussian process corresponding to these eigenvalues has E||iy ^||^2 ^ < oo for every 
a < m — 1, and hence may be considered “Sobolev smooth almost of order m — 1”. 

For these eigenvalues the discrete polished tail condition (1.9) can be written in the form 




t,j^n — ^ 


EE h 




2=1 j = l 2 = 1 j = l 


for sufficiently large m. The theorems below show that the credible sets corresponding to this 
prior cover functions that satisfy this condition. 


1.1. Organization of the paper 

The paper is structured as follows. In Section 2 we analyse the estimators Cn of c and next 
prove our main results about the coverage of the empirical Bayes credible sets (Section 2.1). 
We follow up with results about contraction rates of oracle type and over various concrete 
models (Section 2.2). In Section 3 we study the hierarchical Bayes method, starting with the 
concentration of the posterior distribution of the scaling parameter c and next using this to 
determine coverage and contraction. Section 4 concerns the interpretation of the polished tail 
condition, which is related to a similar condition on the Fourier coefficients of /. It is shown to 
be satisfied with probability one under the prior. This section also discusses various alternative 
smoothness assumptions on the function /. Section 5 is a closing discussion, which addresses 
conditions, interpretations, and generalizations of our results. Finally Sections 7 and 8 gather 
technical proofs and technical lemmas. 


1.2. Notation 

The notation an x bn means that anjbn is bounded away from 0 and infinity, as n —>■ oo, and 
On ~ bn means that anjbn tends to I. If an and bn are functions, then we say that an ^ bn or 
ctn ~ bn uniformly over a domain if the constants away from 0 and inhnity can be chosen the 
same for every value in the domain, or the convergence to 1 is uniform. 

The notation o < 6 means a < Cb for a universal constant C. 

For a function g : X ^ W, the vector (^g{xi^n), ■ ■ ■ ,g{xn,n)) is denoted by gn- The same 
notational device is used for a vector Sn composed of variables ei^n-, ■ ■ ■ Tn,n. 

Unless stated otherwise the set In is the interval In = [log n/n, 
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2. Empirical Bayes 


By substituting the model equation Yn = fn + £n, we can decompose the quadratic forms in 
the empirical Bayes criteria (1.4) and (1.6) as 

yj = /T Y/% + + 2 /T A: e {i, 2 }. ( 2 . 1 ) 

We next express both /„ and Sn relative to the orthonormal eigenbasis ei^„,... of Un- 
The coefficients of /„ are by their definition the numbers fj^n, while the coefficients of £n are 
i.i.d. standard normal variables Zj^n- The matrix Tjn,c = I + cUn and its inverses and 
have the same eigenbasis as Un, with eigenvalues (1 + cXj^n), (1 + cXj^n)~^ and (1 + 
respectively, for Xj^n the eigenvalues of Un- It follows that the two types of empirical Bayes 
estimators Cn minimize criteria and of the form 

Ln{c, f) ■■= T>l,n(c, /) + T' 2 ,n(c) + Rl,n{c, f) + i? 2 ,n(c) (2.2) 

= DnicJ) + RnicJ). 


For the risk-based empirical Bayes estimator (1.6) the functions and processes Di^n, D 2 ^n,Ri,\ 
and i? 2 ,n on the right side are defined by 


/ 


<n(c,/) = /TS-2/; = ^ 

y “T 

D 2 ,n{c) = tr((/ - B J)2) = ^ 


R^^nic, f) = 2/TB-2e-; = 2 


(2.3) 


Rln{c) = fnZ^/lsn " HY/D - ^{Zln ” 1 ) = 

i=i i=i 


- (1 -t- cXj^n)^ 


-1 


whereas for the likelihood-based empirical Bayes estimator (1.4) these functions and processes 
are given by 


Df^n{c,f) = fl^n!cfn = Yl 


-11 


J,n 


i=i 


1 -|- cA 


‘3,n 


cXi 


^ 2 ,n(c) = logdet Yn,c - tr(/ - ^ [log(l -h cAj>) - 

j = l J, 


i?f,Jc,/) = 2/lB->-; = 2^ 


i=i 


Zj,nfj,r, 
1 -|- cA 


(2.4) 


■3,r- 


Rinic) = €^n!c^n “ tr(B„_^J - ^(Z^^ “ 1) = “ X] 


(Z? - l)cA,-, 


i=i 


i=i 


1 -|- cA 


'J,n 


In general discussions we shall leave off the superscripts R and L, for “Risk” and “Likelihood”, 
and denote both the risk- and likelihood-based functions by Di^n,D 2 ,n,Ri,n,R 2 ,n- In both 
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cases we have shifted the criteria by the factor n ~ 1)5 which does not depend on c, 

in order that the remainder term R 2 ,n be smaller. 

The functions Di^n and T* 2 ,n are deterministic, whereas Ri^n and i? 2 ,n are random pro¬ 
cesses. The processes Di ^ and depend on /, whereas the other processes are free of the 
parameter. Even though the functions and processes differ in the risk- and likelihood-based 
cases, for instance by the power of 1 -|- in the denominators, the two estimators c„ can 
be analysed by similar methods. In Lemma 14 it will be seen that under (1.12) the two func¬ 
tions L> 2 ,n) even though quite different in form, are asymptotically equivalent. The following 
proposition shows that in both cases the stochastic process Rn is negligible relative to the 
deterministic process Dn- 

Proposition 10. If (1.12), (1-19) or (1.20) holds, then for Ri^n ctnd i? 2 ,n o>s given in (2.3) 
or (2./) and the corresponding Dn = Di^n + D 2 ^n in the same display it holds that 


\R,,n{cJ)\ + \R2,nic)\ 
DnicJ) 



(2.5) 


The proof of the proposition can be found in Section 7. In case of the eigenvalues (1-19) or 
(1.20), it should be understood that n is replaced by in the assertion (and the single sums 
in (2.3) or (2.4) by double sums). 

We may view the stochastic process Rn = Ri,n + R 2 ,n in (2-3) or (2.4) as an “estimation 
error” when estimating an “ideal” criterion Dn = Di^n + .C* 2 ,n- The preceding proposition 
essentially says that this error can be ignored. As a consequence the minimizer Cn of = Dn+ 
Rn will behave similarly to the (deterministic) minimizer of Dn- The latter functions consists 
of a part Di^ni', f) that is decreasing in c, from Di^„(0, /) = f‘jn Di^n{oo, f) = 0, and 

a part 712,n that is free of / and is strictly increasing in c, from il 2 ,n( 0 ) = 0 to 7l2,n(oo) > n. 
Minimizing the sum Dn of these functions can be viewed as an attempt to balance these two 
terms. 

In the case of the risk-based empirical Bayes method Di^nic, f) is exactly the square bias 
of the posterior mean at the true regression function /, given a fixed scale c, and il 2 ,n(c) is its 
variance, which is independent of / (see (1.5)). The square bias is decreasing in the scale c, 
while the variance is increasing, and hence the empirical Bayes estimator tries to balance 
the square bias and variance by minimizing an estimate of their sum. The likelihood-based 
empirical Bayes estimator is not as strongly tied to the risk, but we shall see that it performs 
in a similar manner. Here the essence will be that its bias term Di^n is bigger than the bias 
term of the risk-based method, while its variance term has the same order of magnitude. 

For minimizing the risk the empirical Bayes methods always do the right thing. However, 
the coverage of the credible sets depends not on the sum of square bias and variance, but on 
their relationship, or rather the relationship between square bias and the posterior variance 


4(c) = E(||/; - fn,cf I Yn,c) = tr(/ - S4) = 


cA 




i=i 


1 “t" cXj^n 


( 2 . 6 ) 


If for a particular / the square bias exceeds the posterior variance, then the empirical Bayes 
method will put a too narrow credible set too far from the truth, which it will not cover in 
that case. The posterior variance, although not equal to the variance terms D 2 ^m has the same 
order of magnitude as these quantities (see Lemma 14). Thus a lack of coverage is caused 
by too small a value of Cn, giving too small a prior variance and posterior variance, i.e. by 
“oversmoothing” the truth. 
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Notwithstanding the nice properties of the functions Di^n and D 2 ,n for a given n, such over¬ 
smoothing may occur for / for which the “bias” function c i—)• Di nic, f) changes haphazardly 
with n. (We describe this here in an asymptotic framework, with n —>■ oo, but a problem will 
arise for every given n, albeit possibly for different /.) The point is that at different sample 
sizes, different aspects of / determine the behaviour of the empirical Bayes estimators c„. The 
assumption that / satisfies the polished tail condition prevents such haphazard behaviour for 
both empirical Bayes methods. When considering a given method, good behaviour can also 
be more precisely characterised through the corresponding function Di^n, as follows. 

Definition 11 (Good bias condition). We say that the function /, or the corresponding array 
{fj,n), satisfies the good bias condition relative to Di^n if there exists a constant a > 0 such 
that, for c C In, 

Di,n{Kc, /) < K-“Di,„(c, /), for all K > 1. (2.7) 

As a pendant to this condition we call D 2 ^n good variance functions if there exist constants 
b,B,B > 0, independent of n, such that for c £ In we have 


Bk^D 2 ^n{c) < D 2 ^n{kc) < B'k^D 2 ^n{c) for all k < 1. (2.8) 


Since the functions D 2 ^n do not depend on /, the good variance condition merely refers to 
the prior process. Priors satisfying (1.12) give D 2 ^n{c) ^ (cn)^/"* (see Lemma 14) and hence 
yield good variance functions with b = 1/m. 

The essence of these “good conditions” is captured in the purely analytical Lemma 42 in 
Section 8, which is the basis of the proof of the second assertion of the following theorem. 

Theorem 12. Suppose the remainder terms Ri^n o,nd 722,n satisfy (2.5). Then for any f and 
£ > 0 the empirical Bayes estimators Cn given in (l.f) and (1.6), with the corresponding 
function Dn = Di^n + D 2 ,n 0 ‘S given in (2.3) and (2./), satisfy 

Pf(Dn{cn,f) < (1 + e) inf L>„(c,/)) 1. 

V ce/n / 


Furthermore, if f satisfies the good bias condition with constant a, D 2 ^n ore good variance 
functions with constants b,B,B' and fjn — ®'^Pce/n -^2,n(c), then also 

P/(71l,n(Cn,/) < B-\2 + 2£)^+^/^D2,niCn)) ^ 1. 

Proof. Let Cn G In be a minimizer of Dn and set A„ = {c G In : Dn{c, f) < (1 + £)Dn{cn, /)}• 
For the first assertion it suffices to show that Pf{cn G A„) —)• 1. By the definition of Cn, this 
is the case if inf(,^A„ Ln{c, f) is with probability tending to one strictly bigger than Ln{cn, /)• 
Since = Dn + Rn, relation (2.5) gives 


inf Ln{c,f) 


= inf 


C^An 


Dn{cJ){l + 


RnjcJ) 

DnicJ) 


(1 — sup 

Rn{c,f) 

)> 

inf DnicJ) 


Dn{cJ) 

/ 

Lc^An J 


op(l)) 


By the definition of An the infimum on the right side is at least (1 -|- £)Dn{cn, /)• Moreover, 
again by Proposition 10 we have that Ln{cn, f) < Il„(c„,/)(l -|- op(l)). The desired result 
follows, as Dn{cn, f) is strictly positive. 
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For the proof of the second assertion we define Cn as the unique point of intersection of the 
graphs of the functions Di ^ and -D 2 ,n) i-e. the unique solution of the equation /) = 

D 2 ,n{c)- If Cn G In, then by the first assertion Dn{cn, f) < (1 + e)Dn{cn, f), whence the 
assertion follows from Lemma 42(i). If falls to the left of I„, then Di „(c,/) < -D 2 ,n(c) 
throughout In by the monotonicity of the two functions and the assertion is trivially true. 
The assumption that Din{0,f) = YlJ=ifjn below the maximum value of Ll 2 ,n prevents 
that Cn falls to the right of In- D 


The good-bias condition on / is dependent on the prior and the method through the 
function Di^n, which can be or For both methods the condition is implied by the 

discrete polished tail condition. 


Lemma 13. Any f that satisfies the discrete polished tail condition also satisfies the good 
bias condition, for both the risk-based and likelihood-based bias functions Di^n{',f)- 


Proof. If / satisfies the discrete polished tail condition, then 


,E 

j ^ 1 




(1 “t“ 


< E /l»sr E /I»S4L x: (irf-F 


since 1 -|- cXj^n E 2 for j in the range of the sum. The left side is part of the sum that dehnes 
the function Splitting this sum in the parts with cXj^n < 1 and with cXj^n > 1 and 

noting that p < 1, we see 


Of„(c,/)<(l+4L) 


.E 


/. 


J,n 


(I T cXj^n) 


2^ 


(l + 4L)(l + p) 
P 


,E 

3 


f- rX ■ 

(It cXj^n) 


since cXnj/(l T cXnj) > p/(l T p) for j in the range of the sum. The sum on the right side 
becomes even bigger if we let the sum range from 1 to n and is then equal to — ^c (Z)f„)'(c). 
It follows that there exists a > 0 such that 

{D^JicJ) ^ a 
DU^,f) - c- 

On integrating this from c to Kc we find that log /) — log D^^{c,f) is bounded 

above by —aloglL, and the good bias condition (2.7) follows. 

The proof for the likelihood-based function Df ^ differs only in that the power of the terms 
(1 T cXj^n)'^ in the denominator must be decreased from 2 to 1. □ 


The following lemma gives the behaviour of the three variance functions if the eigenvalues 
satisfy (1.12), (1.19) or (1.20). The lemma implies that these three functions are good variance 
functions in the sense of (2.8). 

Lemma 14. The functions D^n given in (2.3), D^n given in (2.4) and Sn given in (2.6) are 
strictly increasing on [0,oo). Furthermore, if (1.12) holds, then 

D2,n{c) - D^nic) X S^(c) X (cn)^/™, 

uniformly in c in In as n ^ oo. The same is true (with instead of n) under (1.20). 
Moreover, if (1.19) holds, then 


=(c) X 


P 

S„2 


(c) 


(cn^)^/™’(l T log(cn^)) 
(cn^)^/™(l T log(n^™/(cn^))) 


if cn^ < n™, 
if cn^ > , 
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uniformly in c in /„ 2 . 

Proof. The monotonicity of and s„ is clear. Under (1.12) the function satisfies 


cni 


n .. n .. 

, TN2 - < (cnSf ( -m , A12 ^ 

[J + cnd)^ pd [J + cno)^ 


where in the second inequality we use that a; x/(l + x) is increasing. By Lemma 43 in the 
appendix the sums are of the order for c £ In- The function s„ can be treated 

analogously. 

The derivative of is given by 



^j,n. 

1 T 




(1 + cXj^nY J (1 + cXj^n) 


E 




The monotonicity of £*2 n i® ^ consequence of the positivity of this function. The value of D^n 
at c is the integral of this derivative over the interval [0,c]. If (1-12) holds, then 





sn 


(^jm _|_ /gjiY 


ds < Jc) 


nr ^ 2 

^^2 r sn^ 

io ^ (j™ + dsnY 


ds. 


By Lemma 43 the integrands are asymptotic to a multiple of (sn^)(5sn)“^'’'^/™' = 
j/./mg- 1 +i/m uniformly in s £ \ln/n,n'^~^\, for any In ^ oo and 6 = 6 and 5 = d re¬ 
spectively. The integral of the latter function over [0, c] is equal to a multiple of (cn)^/™', 
while its integral over [0,ln/n] is of the order The integral of over [0,ln/n] is 

bounded above by a multiple of sn^ ds x Hence both remainders are of 

lower order than (cn)^/™' for c £ /„ if In is chosen equal to, for instance, log log n. 

The proof under (1.20) is the same, except that we use Lemma 45 instead of Lemma 43. 
The final assertion also follows along the same lines, but now employing Lemma 44. The 
details are deferred to Section 7.2. □ 


2.1. Coverage of the empirical Bayes credible sets 


The function / is contained in the empirical Bayes credible sets (1.7) if \\fn — /n,c„|| < 
Mrn{cn,ri). In view of (1.3) and (1.1), the square of the left side can be decomposed for any 
c as 


\\fn,c - fnf = fl^nlfn " - Xl/pEn + ef (/ - ^/Icfsn 

= Dl^{c, f) + Dl^{c) + i? 3 ,n(c, /) + i? 4 ,n(c), 

where the first two processes on the right are defined in (2.3) and (2.4) and 

ipXj^n^Zj^nfj,n 

pP (1 -|- cXj^nY 


R^,n{cJ) = -2/IS-|,(/ - S-ije; = -2^^ ■ 


i?4,n(c) = el{I - S-i)2e-; _ tr((/ - S-i)^) = ^ 


1=1 


{cXn,?{Zl^-l) 

(1 -|- cXj^n)^ 


(2.9) 


( 2 . 10 ) 


The following proposition shows that the remainder i? 3 ,n + R 4 ,n is negligible relative to the 
deterministic process Dn, for both the likelihood-based and risk-based functions. 
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Proposition 15. If (1.12), (1-19) or (1.20) holds, then for atid given in (2.10) 
and Dn = lli,n + ^ 2 ,n given in (2.3) or (2./) we have 


|^3,n(c,/)| + |-R4,n(c)| 



( 2 . 11 ) 


The proof of the proposition can be found in Section 7. 

The radius rn(c, rf) of the Bayesian credible set is the r^-quantile of the posterior distribution 
of II/n — /n,c|| given c. As the distribution of fn — fn,c does not depend on Y, the radius r„(c, rf) 
is deterministic. Since the posterior distribution of fn — fn,c is multivariate normal with mean 
zero and covariance matrix I — (see (1.3)), the square norm is equal in distribution to 
the variable 


Nn{c) = Y, 
1=1 


(^^j,nZjn 

1 T cXj^n 


( 2 . 12 ) 


where the Zj^n are independent standard normal random variables. The mean of this variable 
is by its dehnition the posterior variance s^(c), given in (2.6). The following proposition shows 
that the variables Nn degenerate to their mean as n —)• oo. 


Proposition 16. If (1.12), (1.19) or (1.20) holds, then 


sup 

CGin 


Nn{c) 

sUc) 


0 . 


(2.13) 


The proof of the proposition can be found in Section 7. 

We are ready for our main result on coverage. The result applies to discrete polished tail 
functions and under every of the three eigenvalues conditions, but we give a more general 
statement, which takes the output of the preceding propositions as its conditions. 

Theorem 17 (Coverage). Suppose the following conditions hold: 

1. the remainders Ri^n vtnd i? 2 ,n behave as in (2.5) and Rs,n ctnd R^^n behave as in (2.11), 

2. (2.13) is satisfied, 

3. D 2 ji{c) X Il2,n(c) 'Sn('i) Uniformly in c C In, 

/. the function f satisfies the good bias condition and fjn — ^uPce/n -^ 2 ,n(c)- 

Then Pf{f G Cn,r],M) 1; for both the risk-based and likelihood-based credible sets (1.7) and 
sufficiently large M. In particular, this is true If (1.12), (1-19) or (1.20) and condition 4 
above hold. 

Proof. Since A^„(c)/s^(c) —)• 1 in probability uniformly in c G I„, the quantities r^(c, rf)/s‘(^{c), 
which are the 77 -quantiles of the variables Nn{c)/Sn{c), tend to 1 as well, uniformly in c. In 
order to see this, suppose that sup(,gj-^ \Vnio, rj)/s‘^{c) —1| 7 A 0. Then there exist a subsequence 
s-nd points Ck G In such that \rni^{ck,r])/Sn^{ck) — 1| > e. We may assume that we 
either have v‘/,^{ck,rf)/s‘(n,^{ck) > 1 + e for all k or ( 0 ^, 77 )/s^^(cfc) < 1 — e for all k. In the 
latter case, we see that along this subsequence we have 


p ( Nn,{ck) rl^{ck,r]) \ 


< P 


sup 


Nn.jc) 

<(C) 



^ 0 


by (2.13). The case that rnf.{ck) > 1 -|- e can be treated similarly, where now this probability 
tends to one. In either case, this contradicts the definition of r„(c, 77 ). 
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It follows that the function / is contained in Cn,ri,M if ll/n,c„ — /n|P/Sn(cn) < M^(l+op(l)). 
By the decomposition (2.9) this is equivalent to 


^CniCn, f) + D^-niCn) + R3,n{Cn, f) + RaA^u) 


< m2(1 + op(1)). 


By assumption s‘^{cn) has the same asymptotic behaviour as both D^nicn) and D^nicn), up 
to a multiplicative constant. If / satishes the good bias condition for the risk-based proce¬ 
dure, then D 2 n{cn) ^ f) with probability tending to one by Theorem 12, whence 

D^{cn,f) ^ D 2 n{cn) ^ 'Sn(cn)- It then follows that the first two terms in the display are 
bounded above, while the remainder terms tend to zero by (2.11). 

By definition we always have D^^{c, f) < /). If / satisfies the good bias condition for 

the likelihood-based procedure, then D^^{cn,f) < T> 2 „(cn) with probability tending to one 
by Theorem 12, while D^nicn) ^ R^ni^n) by assumption. It follows that again f) < 

R 2 n{cn), and the proof is analogous to the risk-based case, where for the last two terms we 
use the fact that Di^{cn, f) x D^nicn) ^ s‘^{cn)- 

The final assertion of the theorem follows by Propositions 10, 15 and 16 and Lemma 14, 
which show that all assumptions hold under (1.12), (1.19) or (1.20) and the conditions on 
/• □ 


2.2. Contraction rates of the empirical Bayes posteriors 

We first consider the risk-based setting. If the remainder processes in (2.9) are negligible 
relative to uniformly in c G which is true under our three eigenvalue 

conditions by Proposition 15, then 

\\fn,c^-fnf=Op{D^{CnJ)). (2.14) 

For the estimator Cn the right side is by the first assertion of Theorem 12 of the order (in 
probability) 

inf D^{c,f) 
cein 

with probability tending to one. Since f) is exactly the risk of the estimator fn,c for a 

given c, these two assertions combined can be viewed as an oracle type inequality for the risk- 
based empirical Bayes plug-in posterior mean fn,cn- fbe empirical Bayes estimator manages 
to choose the best value of c for each possible /. The family of estimators fn,ci where c £ In, 
turns out be rich enough to give an optimal estimation rate for the usual regularity classes. 
Thus the estimator /n,c„ adapts to unknown regularity in the usual sense. We formalize this 
in the next theorem, together with the observation that the posterior variance also adapts 
correctly. From this we deduce that the full posterior distribution contracts adaptively. 

Write nc(-1 Yn) for the posterior distribution of /„ given c and let II^^ (• | Yn) be the same 
object, but with c replaced by Cn- 

Theorem 18 (Contraction, risk-based EB). Suppose the following conditions hold: 

1. the remainders Ri^n and R 2 ,n behave as in (2.5) and R^^n and R^^n behave as in (2.11), 

2. D 2 n{c) X Sn{c) Uniformly in c £ In. 
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Then for Cn given by (1.6) and any sequenee Mn —>■ oo, 

: INn - inll^ > Mn inf Ef\\fn,c - fnf \ Yn) ^ 0. 

V ce/n / 

In particular, this is true if (1.12), (1.19) or (1.20) holds. 

Proof. Let W denote a variable that given Yn and c is distributed according to the posterior 
distribution of /. Then by Markov’s inequality, for any M and c, 

M^U,{w:\\wn-fnf>M^\Yn) < E [|| #„ - ^ || ^ | c] 

< \\fn,c - fnf + E[||l4 - /n,cf I Yn, c] . 

The second term on the far right is the posterior variance s^(c), which by assumption is 
bounded by a multiple of D^nic) Y E)^(c,/) uniformly in c G In. The first term on the far 
right evaluated at c = c„ is bounded above by Dn{cn, f) with probability tending to one, in 
view of (2.9) and (2.5) and (2.11). It follows that with probability tending to one 

(in : \\wn - fnf > M^ \ Yf < -^D(({cn, mf D(({c, f) 


by the first assertion of Theorem 12. Since Dn{c, f) = Ef\\fn^c — fnf, the proof is complete. 

□ 

Thus the risk-based empirical Bayes method attains a rate of contraction equal to the best 
estimator in the class of estimators fn,c^ for c G In standard models this class contains a 
rate-minimax estimator. 

Example 19 (Sobolev norm). Denote by the set all functions / for which the discrete 
Sobolev norm ||/||n,o) defined in (1.11), is bounded by 1. For eigenvalues satisfying (1.12) and 
/ G S'" for a < m we have 


n ■2m f2 


oh(c./)£Eu 




(cn)!/” 


3,ri ^ 


1=1 


^ (jm _|_ ~ (cn)^ 


1=1 


E E fir 




(cn)2 

< n(cn)-2"/™. 


{cn)‘^°‘/r 


E Yf: 


2 

j,n 


j={cn)^/'^+l 


In combination with Lemma 14 we find that 

-D^ic, f) < (cn)-2"/™ + n-fcnf/-^. 
n 

The argument c = 7 ),™-/(i+ 2 «)-i equates the two terms and gives a value of the order 
n“^"/(i+^"). By Theorem 18 this is the square contraction rate of the plug-in posterior dis¬ 
tribution with the risk-based empirical Bayes estimator (1.6) relative to the scaled Euclidean 
norm || • ||„,o- 

For a > m the order of the square bias Df„(c, /) does not improve beyond the rate n{cn)~‘^ 
found for a = m and hence nor does the contraction rate. 
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Example 20 (Hyperrectangles). Denote by 0“ the set all functions / for which the discrete 
Sobolev norm \\f\\n,a,ooj defined in (1.11), is bounded by 1. For eigenvalues satisfying (1.12) 
and / € 0" we have 


.7 = 1 


< 


n 


V- 

a 


^2m—2a—l 


)2 ~ ^ ^jm + ctt) 


< 

2 


?T.(c7l)“^“/™’ 

< ?T.(cn)“^ log(c?T.) 
^n{cn)~‘^ 


if a < m, 
if a = m, 
if a > m. 


The first case follows directly by Lemma 43, the second by writing 


f \ 1 / m 

^ j2m-2a-l T"'.' j2m-2a-l 'n- 

n} -r-- - ^ = n y T. - ^ + n > 

(jm _|_ (;fi'j2 (^jm _|_ Qjiy2 


i=i 


j=(cn)^/™+l 


j2m—2oL—l 
(j™- + C7r)2 


and applying a variant of the lemma to the second sum. The third case follows immediately 
by using + cn > cn. For a < m and a > m this is the same result as in Example ??, 
leading to the same conclusions on the contraction rate. For a = m the additional logarithmic 
factor leads to the square contraction rate n“^“/^^“"’“^^(log?T,)^/(^“"’'^^. 

The likelihood-based empirical Bayes method also satisfies an oracle type inequality, but 
relative to a loss function that is not as closely linked to the L 2 -risk of the posterior mean. 
Because its “bias term” is bigger (the inequality is immediate from defi¬ 

nitions (2.3) and (2.4)), while its “variance term” has the same order of magnitude, in 
its attempt to balance bias and variance the likelihood-based empirical Bayes method may 
choose a bigger estimator Cn than the risk-based method. This may have an adverse effect on 
the contraction rate of the plug-in posterior distribution. 

Theorem 21 (Contraction, likelihood-based EB). Suppose the following conditions hold: 

1. the remainders Ri^n and ii 2 ,n behave as in (2.5) and and R^^n behave as in (2.11), 

2. D 2 n{a) ^ Sn(c) Uniformly in c C In- 

Then for Cn given by (1./) and any sequence Mn oo we have 


n, 


Cn 


(^w : 


Wn - 



0 . 


In particular, this is true if (1.12), (1.19) or (1.20) holds. 

Proof. Since we obtain as in the proof of Theorem 18 that 

\\fn,Cn-fnf=Op{D(:{CnJ)). 


Next we can use the first assertion of Theorem 12 to replace the right hand side by the 
infimum of D^(c, /) over c. The posterior variance is of the same order as D^n and hence the 
proof can be concluded as the proof of Theorem 18. □ 

Even though the loss function of the likelihood-based empirical Bayes estimator does not 
relate correctly to the risk in general, the method does give optimal contraction rates on the 
models in the preceding examples, albeit for a smaller range of regularity levels. 
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Example 22 (Sobolev norm). For eigenvalues satisfying (1.12) and f £ for a < m/2 we 
have 


(cn) 


1 /n 


n -m f‘2 -| 


“ j'm _|_ Cfi cn 


< 

rsj 


^Cn)im-2a)/r 


a)!/" 


cn 






j=l 


(cn)2a/»i 


E /“/I. 






The upper bound has the same form as for the risk-based empirical Bayes method. Since 
we obtain the same contraction rate results. The difference is that the rate does 
not improve for a > m/2. 

Example 23 (Hyperrectangles). For eigenvalues satisfying (1.12) and / G 0“ we have 


DinicJ) < 


nj 


-2a-l 


i=i 


“t" cA 


< 


n 




m—2a—1 




~ 'i'™ -hen 


< 


1=1 


n(cn) 
c“^ log(cn) 


r,-i 


if a < m/2, 
if a = m/2, 
if a > m/2, 


This leads to the contraction rate n "/(2"+i) relative to the scaled Euclidean norm || • ||„^o if 
a < m/2 and the square contraction rate (log if a = m/2. 


2.3. Diameter of the empirical Bayes credible sets 

The empirical Bayes credible sets inherit their diameter from the contraction rate. 

Corollary 24. Under the eonditions of Theorems 18 and 21 the square of the diameter 
Mrn{cn,ri) of the credible sets (1-7) is of the order f) and infDll{c, f) for 

the risk-based and likelihood-based empirieal Bayes proeedures respectively with probability 
tending to one. 

Proof. By Theorems 18 and 21 the empirical Bayes posterior distributions concentrate all their 
mass on a ball of radius of the same order as the given rate. Since the posterior distribution is 
Gaussian, the balls Bn of the same radius centered at the posterior mean must also have mass 
tending to one. By definition the credible sets are balls of posterior mass p G (0,1) around 
the posterior mean, and hence are contained in the Bn. 

Alternatively, the square radius r/^^Cn,^) was seen to be of the same order as the posterior 
variance s‘^{cn), which was in turn seen to have the given order. □ 


3. Hierarchical Bayes 

The hierarchical Bayes method is closely related to the likelihood-based empirical Bayes 
method, since the posterior density of c is proportional to the product of the the prior density 
vr for c and the marginal likelihood that defines the latter method. More precisely, in the 
model (1.2) augmented with c ~ vr it holds that 

TTnic I Yn) OC p{Yn \ c) 7r(c) OC det '^n.cYn 
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The likelihood-based empirical Bayes estimator (1.4) would be the posterior mode if the prior 
density were improper. We shall analyse the hierarchical Bayes method by exploiting this 
link. 

We start with showing that the posterior distribution of c concentrates on the interval 
where the deterministic part of the likelihood-based criterion „ -|- is small. This 

criterion is derived from minus the log marginal likelihood. On closer inspection it becomes 
evident that the prior density vr, which we will choose inverse gamma, also plays a role and 
adds a term 1/c to this criterion. We truncate the inverse gamma prior to the interval In, so 
that c has a prior density so that, for some fixed At, A > 0, 

7r(c) oc , c € In- 


Theorem 25. Suppose the following conditions hold: 

1. the remainders Rin ^ 2 n satisfy (2.5), 

2. the function D^n is a good variance function with „(c) > log (nc), 

3. there is a minimizer Cn{f) of ce^ Dl^{c, f) + 2\/c over c G (0, oo) that satisfies Cn{f) G 
In ond 2Cfi(/) G In- 

Then for sufficiently large M 


n. 


Dnic,f) + - < Minf 
c oo 


^n(c,/) + - 



Furthermore, if f satisfies the good bias condition relative to then 

n, (c : Df^^ic, f) + l< Dl^{c) I Yn) ^ 1. 
Moreover, there exist constants 0 < k < K < oo such that 


Un{c:ce [kCn {f),KCn{f)]\Yn)^l. 

In particular, these assertions are true if (1.12), (1-19) or (1.20) holds, for every f satisfying 
condition 3. 


Proof. For every measurable set J Y In, 


Iln, ( c ; c G T I Yn 


fjC 7r(c) dc 

7r(c) dc 




by the decomposition (2.2). Define £n{c, f) = D(^{c,f) + 2A/c, so that Cn ■= Cn{f) is a 
minimizer of in- In view of (2.5) we have, for any d > 0, 


2A 

ln{c, m -^)< D(){c, f) + R(i{c, /) + — < 4(c, /)(1 + <5), 


with probability tending to one. Consequently, 


n„ c : c G JI Tn < 


r g^-2^n(c,f}(l + 5) ^-K-l 

'Jin 
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with probability tending to one. Since is a good variance function, we have that 

D^^{2cn) < {B’)~^2^D2^n{cn)- Because is decreasing and is increasing, we then 
also have that in{c, f) < {B')~^2^£n{cn, f) for every c G [cn,2c„]. Combining this with the 
fact that inic, f) > 2A/c, it follows that 


nJc:inicJ)>M£niCn,f)\Yn) < 




e ■ 2 B‘ 


-^2I>(1+S)e„icnj) r2cu 

J Cj 


' Cn 


dc 


roo 

-/)/ e-^ 

Jo 


l(l-5)A/c 


for M(1 — 6) > (4k + 2(B')~^2^)(1 + 6). If c„ —0, then this clearly tends to zero. If Cn is 
bounded away from zero, the above also tends to zero, by the assumption that £n(c, f) > 
log(cn). This concludes the proof of the first assertion of the theorem. 

If / satisfies the good bias condition, then, for ii' > 1, 


D^(Kc,f) + 


2A 

ITc 


< R-^D 


2A 


l,nl 




2A 


c J 


In other words, the function c Df^(c, f) + 2A/c also satisfies a good bias condition. 

Let An = {c : £n{c, f) < Min{cn, /)}, for Cn the solution to the equation D^^(c, /)+2A/c = 
L >2 „(c). Since in{cn, f) < ^n(cn, f), we have that n„(c : c G A„ | Yn) —>• 1 by the first part of 
the proof. Since in is the sum of the decreasing function Di^(c, /) + 2A/c and the increasing 
function D^n, which are both “good” functions, it follows that Di^(c,f) + 2A/c < D^nic) 
for every c G A„ by Lemma 42 (i). Furthermore, Lemma 42 (ii) gives the existence of constants 
0 < fci < iLi < oo with An C [kiCn,KiCn]- Since Cn G A„, it follows that also A„ C 
[ki/KiCn, Ki/kiCn]- This proves the second and third assertions of the theorem. □ 

The theorem shows that under the posterior distribution the scaling c will concentrate on 
the set of small values of the criterion c e-)■ D(^^(c, f) + 1/c. This differs by the term 1/c from 
the criterion minimized by likelihood-based empirical Bayes estimator Cn defined by (1.4), 
whose behaviour is given in Theorem 12. The additional term is due to the prior distribution. 
The usual prior distribution, which we consider here, has very thin tails near 0, and the extra 
term 1/c essentially prevents the posterior distribution to concentrate very close to zero. 

Very small values of the scaling parameter c are advantageous for very smooth functions /. 
For such functions the bias term f) will be very small and the balance between square 

bias DJ/^(c,f) and variance D^nic) will be assumed for small c. The additional term can 
be viewed as adding an artificial bias term of the order 1/c, thus shifting the bias-variance 
trade-off to bigger values of c. 

In most cases this is not harmful. In particular, the shift will not be apparent in contraction 
rates over the usual smoothness models (see Example 29). The following example shows that 
this is different for very smooth /. 

Example 26. The smoothest imaginable function / is the zero function. For / = 0, the 
bias function iAf^(c,/) in (2.4) vanishes. If the eigenvalues satisfy (1.12), then the variance 
D 2 n(c) is of the order (cn)^/™ by Lemma 14 and the criterion becomes 


c^I)^(c,/) + -x(cn)fo- + -. 

c c 
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The right side is minimized by Cn x . Theorem 25 shows that the posterior 

distribution for the scale parameter c will concentrate on the set of c that minimize the 
criterion up to a multiplicative factor. This set is contained in an interval with boundaries of 
the order . 

The fact that this interval shrinks to zero is good, as the variance is smaller for smaller 
c, while the bias is negligible. However, it is a bit disappointing that the shrinkage is not 
faster than of order In comparison, the empirical Bayes estimator Cn will shrink 

at the order logn/n, the minimal possible value permitted in our minimization scheme by 
Theorem 12. 


3.1. Coverage of the hierarchical Bayes credible set 


The hierarchical Bayesian credible sets cover true parameters under the same conditions as 
the empirical Bayes sets. 

Theorem 27 (Coverage, HB). Suppose the following conditions hold: 

1. the remainders Ri^ ^ 2 n behave as in (2.5) and o^nd behave as in (2.11), 

2. (2.13) is satisfied, 

3- is a good variance function with > log(nc), 

/. there is a minimizer Cn{f) ofc<-^ D(^{c, f) + 2X/c over c € (0, oo) that satisfies c„(/) £ 
In and 2cn{f) £ In; 

5- D§n{c) X T> 2 ,n(c) - 4(c) Uniformly in c C R, 

6. the function f satisfies the good bias condition. 

Then the hierarchical Bayes credible sets (1.8) satisfy Pf{f £ Cn,ri,M) —^ 1 for sufficiently 
large M. In particular, this is true if (1.12), (1.19) or (1.20) holds and conditions 4 ond 6 
hold. 


Proof. The function / is contained in Cn,ri,M as soon as there exists some c £ [ci^„(r/i), C 2 ,n(^i)] 
with Wfn — fn,c\\ < Mrn{c,r] 2 ). Since Nn{c)/Sn{c) —)• 1 in probability uniformly in c £ R hy 
(2.13), the quantities 4(c, ? 72 )/s^(c), which are the ? 72 -quantiles of the variables Nn{c)/Snic), 
tend to 1 as well uniformly in c. In view of the decomposition (2.9) it follows that the function 
/ is contained in Cn,r],M as soon as there exists some c £ [ci,n{'ni),C 2 ^n{'ni)] with 


/) + D^^^ic) + Rs,n{c, f) + i?4,n(c) 

4(c) 


< m2(1 + op(1)). 


By assumption s^(c) is equivalent to both D^nic) and D^nic), up to a multiplicative constant. 
In particular, the second term on the left is bounded above. 

By the second assertion of Theorem 25 the posterior probability of the set := |c ; 
Din{c,f) ^ ^ 2 n(c)} tends to one in probability. Since ci^nioi) and R^nioi) are nontrivial 
quantiles of the posterior distribution of c, the interval [ci,n(^i)) C 2 ,n(f?i)] must intersect A„ 
with probability tending to 1. For c = in this intersection it holds that Dn{c, f) x D^nic) 
and hence s^(c) in the preceding display can be replaced by Dn{c, /), up to a multiplicative 
constant. This shows that the remainder terms tend to zero, in view of (2.11). The first term 
-C’pn(c,/)/4(c) is bounded by a multiple of f)/D(^{c, f) < T)(^„(c,/)/£)(;„(c,/) < 1, 

by definitions (2.3) and (2.4). This proves the first assertion of the theorem. 

The final assertion of the theorem follows by Propositions 10, 15 and 16 and Lemma 14, 
which show that all remaining assumptions hold under (1.12), (1.19) or (1.20). □ 
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3.2. Contraction rate of of the hierarchical Bayes posterior 


As in Section 2.2 write nc(-1 Y^) for the posterior distribution of fn given c. Then the hier¬ 
archical posterior distribution can be decomposed as 


n, 


i{w: Wn e B\Yn) = J Ilc{w: Wn e B \ Yn) TTnic | Yn) dc 


for B C M"- measurable. Here '7r„(c | Yn) is the posterior density of c, analysed in Theorem 25. 

This hierarchical posterior distribution contracts to the true parameter according to an 
oracle inequality, with the likelihood-based criterion augmented by the extra term 1/c. 

Theorem 28 (Contraction rate, HB). If conditions 1, 3, /, and 5 of Theorem 21 hold, then, 
for any sequence Mn —)■ oo, 


n 


n 



Wr. 


fnf > Mn 


inf 

CGin 


ll 


Dn (C, /) + - 



Proof. Let Cn G In be a minimizer of c e-)- Dn{c, f) + 1/c and for given Mi define a set 

Cn = {ccln: D/:{c, f) + 1/c < Ml [D/i{Cn, f) + 1/Cn]]. (3.1) 

By Theorem 25 the posterior probability that c € Cn tends to 1 in probability, for sufficiently 
large Mi. Therefore, for any M > 0 we apply the above decomposition of the posterior to 
find 


nn(ic : \\wn - fn\\> M\ Yn) < sup nc(rr : \\tVn - .^11 > M I Yn) + n„(c : C^ Cn\Yn) 

CGCn 

- - fnf + 4(c)] + Op(l) 


by Markov’s inequality. In view of (2.9), this is further bounded above by 
1 


T79 sup 


f) + -^Xn(c) + R‘i,n{c, f) + R4,n{c) + s'nic) -|- Op(l). 


-iP 


Here D^n < D^ni und D 2 n is of the same order as D^n und s^. It follows that the first two 
terms are bounded by a multiple of Dn{c, f) < MiDn{cn) + l/c„. The remainder 

terms are of the order Dn{c, f) uniformly in c G In with probability tending to one by (2.11) 
and hence are similarly bounded. □ 

Example 29 (Sobolev). It was seen in Example 22 that for eigenvalues satisfying (1.12) and 
/ G S'" for a < m/2 we have 

Dlnic, f) + Dlnic) < n(cn)-2"/- + (cn)V-. 

The upper bound on the right side has minimum value at Cn x ^ 

point the term 1/cn is smaller than (for a < m/2). It follows from Theorem 28 that 

on the model S" the hierarchical Bayes posterior distribution contracts at the same rate as 
the likelihood-based empirical Bayes method. 
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Example 30 (Hyperrectangle). It was seen in Example 23 that, for eigenvalues satisfying 
(1.12) and / G 0“, 




< 


n(cn)“^“/™' + (cn)^/"* 
c~^ log(cn) + 
c~^ + (cn)^/™' 


if a < m/2, 
if a = m/2, 
if a > m/2, 


It follows again that the hierarchical Bayes posterior distribution contracts at the same rate 
as the likelihood-based empirical Bayes method. 

Example 31 (Zero function). The square bias of the function / = 0 is equal to zero. 
For eigenvalues satisfying (1.12) the minimum of c i—)• D/^{c, f) + 1/c is assumed at Cn x 
resulting in a rate of contraction for the scaled Euclidean norm || • ||n,o of the 

order n 

In contrast the empirical Bayes estimators attain a rate of contraction of the order 
up to a logarithmic factor. 

The same difference between the hierarchical and empirical Bayes methods exists for (se¬ 
quences of) functions / with a square bias Di^{c,f) that tends to zero at an exponential 
rate. 


3.3. Diameter of the hierarchical Bayes credible set 


The diameter of the credible sets is again of the same order as the contraction rate. 

Theorem 32. Under the conditions of Theorem 28 the diameter of the credible sets (1-8) is 
of the order infcg/^ [D/^{c, /) -|- 1/c] with probability tending to one. 

Proof. In view of Proposition 16, for fixed c the radius of the credible set {re : \\wn — fn,c\\ < 
Mrn{c,rj 2 )} is of the order the posterior standard deviation Sn(c) given by (2.6). Thus the 
triangle inequality gives that the diameter of Cn,ri,M is bounded above by a multiple of 

sup [Sn{c) + \\fn- fn,c\\]- 

Cl,n{m)<C<C2,n{m) 


The supremum of the function in this display over the set Cn defined in (3.1) is shown to 
be of the desired order in the proof of Theorem 28. The theorem would follow if the interval 
[ci,n(^i)) C 2 ,n(??i)] belongs to Cn with probability tending to one. 

By Theorem 25 the posterior distribution of c concentrates all its mass on the sets Cn. 
Since ci,n(^i) and C 2 ,n(??i) are nontrivial quantiles of this distribution, we can conclude that 
they must belong to the convex hull of Cn with probability tending to one. If this convex hull 
is [cm, Cm], then for any c in this convex hull 


-^n(C)/) + - — Df„^{c,f)-\ |-T>2,n(c) < -Of,n(Cm!/)H l“-^2,n(cM) < 2Mi 

C C Cjji 


Dn{CnJ) + — 

Cn. 


Thus the convex hull of Cn is contained in a set of the same form as Cn, but with the constant 
Ml replaced by 2Mi. The proof of Theorem 28 still shows that the supremum over this bigger 
set is of the desired order. □ 
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4. On the polished tail condition 

The parameter in the regression model (1.1) is a fixed function /, but most of the results of 
this paper are driven by the representation of the restriction /„ of / to the design points in 
terms of the eigenvectors Cj^n of the covariance matrix [/„ of the (unsealed) prior restricted 
to the design points. It is clearly of interest to relate the “continuous” object / to its discrete 
counterparts, but this is more involved than it may seem. 

In this section we investigate the relationship between the continuous and discrete setups 
for the special case of the Brownian motion prior. 

4.1. Aliasing 

For the design points Xi^n = i/n+i where n+ = n + 1/2, the eigenvectors of the covariance 
matrix Un of discretized Brownian motion are given in (1.16) for j £ {1,..., n}. The formula 
shows that they are l/v^n+ times the restrictions of the eigenfunctions ej to the design points. 
Using this correspondence we may also define vectors ej^n £ 1^"' for j > n, again by (1.16), 
by discretizing the higher frequency eigenfunctions of Brownian motion. Since the vectors 
ei^n, ■ ■ ■ ,^n,n are an orthonormal basis of M"', these further vectors are redundant. It turns 
out that their linear dependency on the vectors ei^n for i < n takes a very special form: 

(i) The vectors are (2n + l)-periodic in i: ej+2n+i,n = ^i,n for all i. 

(ii) The vectors in the middle of a (2n + 1) period vanish: en+i,n = 0. 

(hi) The vectors within a (2n+l) period are anti-symmetric about the midpoint: e 2 n+ 2 -i,n = 
—ei^n for all i. 

In particular, every ej^n with j > n is either zero or “loads” on exactly one Ci^n with i G 
{1,... ,n} with coefficient 1 or -1. This leads to a simple connection between the inhnite 
expansion of a function / = fj^j fo the eigenfunctions Cj of continuous Brownian motion 

and the finite expansion /„ = fi,n^i,n of the discretized function /„ in the eigenvectors 
ej^n of discretized Brownian motion, as follows. Assuming that the series f{x) = fj^ji^) 

converges pointwise, we can use (1.16), which says that {ej)n = and (i)-(iii) to see 

that the coefficients in /„ are given by 

CO 00 

fi,n = ^ fji^j)n^i,n = \/^+ ^ ^ (/(2n+l)Z+-i “ /(2n+l)Z+2n+2— i)• ('^•1) 

j=0 1=0 

The terms of this last series correspond to the consecutive periods of lengths (2n -|- 1). Ex¬ 
actly two of the inner products per period are nonzero and they yield coefficients 1 and — 1 
respectively. The formula is an example of the aliasing effect in signal analysis: the energy of 
the function / at frequencies j higher than the Nyquist frequency n, whose fluctuations fall 
between the grid points, is represented at the lower frequencies. 

The scaling ^/n+ results from the normalisation of the vectors ei^n in 1^”. However, even 
apart from the normalisation the correspondence between the discrete and continuous coeffi¬ 
cients is imperfect. By writing (4.1) in the form 

oo 

fi — f2n+2-i + '^{f (2n+l)l+i “ f {2n+l)l+2n+2-i) ^ 
l=l 
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we see that is in general not equal to /*. The “harmonic frequencies” at periods 

2n + 1 add to a frequency at i € {1,2,... ,n}, and the frequencies mirrored around the 
midpoints of the blocks subtract from it. 

It is clear from the preceding display that a given discrete sequence (/i,n) can be obtained 
from the inhnite sequence (/i,n) f 2 ,n, ■ ■ ■, fn,n, 0,0,.. .)/y^nIjr of coefficients, but also from 
many other infinite sequences (fj). Because the data model (1.1) depends on / only through 
the discrete sequence there is clearly no hope to recover which of these infinite sequences 

would be the “true” sequence. Furthermore, for a given fixed infinite sequence the values of 
the array (/i,n) will change with n, and for some reasonable infinite sequences the series 
defining the discrete coefficients may not even converge. (We obtained the preceding display 
under the assumption that the series fjej{x) converges pointwise.) The following lemma 
shows that the infinite series is essentially a Fourier series, and hence this less than perfect 
correspondence is disappointing. 

Lemma 33. For a given / : [0, 1] —>• R in ^ 2 ( 0 , 1], the expansion f = fjOj is derived from 
the Fourier series of the function x on [0,2], where f is extended to [0,2] by 

symmetry about 1. In particular, if f £ (^“[0,1] for some a > 0 and /(O) = 0, then 

00 

fix) = Y, fjej{x), uniformly in x. 

1=1 

Proof. The function x e®™/^/(x), with / extended as indicated, is periodic (i.e. it has the 
same value at 0 and 2) and contained in L2[0, 2]. Its Fourier series can be written in the form 

jez 


for some Cj G C and hence 

/(x) = 

j& 

Since / is real, the complex part of the right side vanishes, while the real part can be written 
in the form 

fix) = ^ajCos(7rx(j - 1/2)) - 6jsin((j - l/2)7r3:), 

jez 

for aj,bj G R. Since / is symmetric about 1, the antisymmetric cosine part vanishes, while 
the terms with j < 0 of the sine part can be united with terms with j > 1. This gives 
an expansion in terms of the eigenfunctions Cj. By the orthogonality of these functions the 
resulting expansion is unique. 

If / G (^“[0,1], then the extended function x 1 -^ e*^^/^/(x) is contained in (^"[O, 2] and hence 
we uniform convergence in (4.2). The uniform convergence is retained under multiplying left 
and right with □ 

As a consequence of the lemma, the speed at which the fj tend to zero as j —)• 00 can be 
interpreted in the sense of Sobolev smoothness. However, this is not easily comparable to the 
smoothness of the corresponding array (/i,n)- In fact, if / is contained in a Sobolev space of 
order a for a <\j2, that is j‘^°‘f'j < 00 , then the aliased coefficients may not even be well 
dehned. 
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4-2. Polished tail sequences 

In [18] a function /, or rather its infinite series of coefficients {fj) relative to a given eigenbasis, 
is defined to be polished tail if for some L, p > 0 and all sufficiently large m, 


pm 




(4.3) 


3=m 


j=m 


This reduces to the “discrete polished tail” condition (1.10) if applied to the infinite sequences 
(/i,n, f 2 ,n, ■ ■ ■, fn,n, 0,0,...)/For general sequences {fj) the relationship is less perfect, 
but for typical examples the two concepts agree. 

Example 34 (Self-similar sequences). In [18] an infinite sequence {fj) is defined to be self¬ 
similar of order a > 0 if for some positive constants M, p, L and every m, 

pm 

sup < M, and ^ f] > 

■1 — ^ j=m 

Particular examples are the sequences with the exact order \fj\ x Self-similar se¬ 

quences are easily seen to be polished tail for every a > 0 and arbitrary p > 1. For a < 1/2 
the corresponding function is not necessarily well defined at every point and the series (4.1) 
defining the aliased coefficients may diverge. However, for a > 1/2 the induced array (/i,n) is 
well defined and also discrete polished tail in the sense of (1.10). 

To see this, first note that for l>\ and taking M equal to 1 for simplicity we have 

\f{2n-el)£+i \ ^ I /(2n-|-l)£-|-2n+2— i 1 ^ j^1/2+o£1/2-|-q: ' 

This shows that the series (4.1) that defines the aliased coefficients converges. Furthermore, 
we see that the rescaled coefficients fi^n = fi,n/-s/nif satisfy — fi\ < so that 

l/i,n| ^ -I- and the left side of (1.10) satisfies 


/ ^ ^ 77^20 ' j / 2 a ^ 77720 

i=m 


We wish to show that the right side of (1.10) is lower bounded by the expression on the right, 
where we may assume that m satisfies pm < n, because otherwise there is nothing to prove. 
First we note that 


\fi,n fi I ~ \fi,n fi\ \fi,n + /i] ^ l/2-l-o ' 






1 


77,1/2+“ 


It follows that, for some universal constant C, 


pm 


pmAn 


2 C{p-l)m 


n 


i-e2a 


pm I I 

cE 


> 






L- 


2C(p-l) 

nl/2+o 


For sufficiently large L the constant in the last display is positive. 
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Example 35. The sequence fj = is easily seen to be polished tail for every a > 0, as 

is also noted in Example 34. We shall show that the corresponding array (fi^n) is also discrete 
polished tail in the sense of (1.10), for any a > 0, thus extending Example 34 to the range 
a € (0,1/2]. This refinement is possible by the exact form of the fj, which allows us to exploit 
cancellation of positive and negative terms in (4.1). 

To prove the claim we first apply the mean value theorem to find that, for every i>l, 

\f{2n+l)l+i f{2n+l)£+2n+2-i\ ^ j^l/2+a£3/2+o ' 

This shows that the series in (4.1) dehning the discrete coefficients converges. Moreover, 

2 1 \ 

~ ^l/2+a ^ ^ l/(2n+l)£+i /(2n+l)£+2n+2—i I ^ ^ 1 / 2 + 0 ? j^l/2+a ' 

Consequently, the left side of (1.10) satisfies 


1 

Y P 

/ ^ ■'i-tri ^ 77 i 2 q : 

i=m 


+ 


n 


— < 
2a ~ 


1 

np/a 


Furthermore, since all terms in (4.1) are positive, we also have 


f. >_> _ 

Jhn _ .i/ 2 +„ (2n + 2 - i)V2+a ~ il/ 2 +a ’ 


for i < cn and any fixed c < 1. To bound the right side of (1.10) we may assume that m 
satisfies pm < n, because otherwise there is nothing to prove. Then choosing c < 1 and p > 1 
such that cp > 1, we have 


pmAn cpm cpm 

Y P >Y P >Y —— > / 

Ji,n ~ ,'1+2q! — / 

•__ ^ Jrr 


pcpm 1 -I 

dt > 


fl+2a ~ jp2a 


The right side is seen to be bigger than a multiple of the left side of (1.10). This proves the 
claim. 


4.3. Prior polished tail sequences 


According to the Bayesian model the true function / is a realisation of the prior process W^. 
In this section we show that almost every such realisation gives rise to a discrete polished tail 
array. Consequently, for a Bayesian who believes in her prior, the polished tail condition is 
reasonable. For a non-Bayesian the following proposition is also of interest, as it shows that 
polished tail functions are abundant. 

The proof of the statement will be based on the Karhunen-Loeve expansion. For standard 
Brownian motion = iWf : t £ [0,1]) this is given by 


IF/ = V 




2 ^ (j - l/2)vr ^ 


it). 
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Here Zi, Z 2 ,... are independent standard normal random variables. We see that the prior 
is given by fjej, for the infinite sequence fj = -y/cZj/((j — l/2)7r). We shall show that the 
induced array fj^n defined by (4.1) is discrete polished tail, almost surely. 

In fact a more general result holds for any Gaussian series with polynomially decaying 
singular values relative to the eigenbasis of Brownian motion. 

Proposition 36. For given a > 0 and <5 G M set 

00 ^ 

= E ^ [O’ 1]’ 

where ^ 1 ,^ 2 ,... are independent standard normal random variables. Then almost every real¬ 
isation of W is both polished tail in the sense of (4-3) and discrete polished tail in the sense 
of (1.10). 

Proof. The first claim is proved in Proposition 3.5 of [18]. To prove that W is discrete polished 
tail, we consider the coefficients given in (4.1): 

/ ■^(2n+l)i+i ■^(2n+l)Z+2n+2—i 

^ \{6 + {2n + 1)1 + iy/^+^ ~ {S + (2n + 1)/ + 2n + 2 - i)i/ 2 +a 

In view of Levy’s continuity theorem this array consists for each n of independent zero-mean 
normal random variables Wi^n-, W 2 ,n) ■ ■ ■, Wn,n with variances 

/ 1 1 
var(Wi,„) X g (((2n + l)Z + i)2«+i + ((2n + 1)Z + 2n + 2 - z)2«+i 

Now let L,p> 0 and consider the event = {Yl'i=m ^ Yli=m Setting 

pra n pm n 

x = LY,wi^-Y,^ln = {L-i)Y,wi^- E 

i=m i=m i=m i=pm-\-l 




we see that Em has probability P{Em) = P{X < 0). We then have by Markov’s inequality 
that for r] > 0 


P{Em) 


EIX - EXI*? 

P{X<11)<P{\X-EX\> EX) < ' ' 


We proceed to bound the expectation of X. Clearly the right hand side of (4.4) is bigger than 
Since i < n, it is also smaller than 

1 3 „ 1 

z2«+i ^ (2n + l)2“+i ^ ((2n + l)x + z)2“+i 

- (2n + l)2“+i ((2n + l)x + i)2“+i 

for some Li > 0. It follows that 

pm ^ n ^ ”1 

EX > (L - 1) ^ fpX+i ~ E fpX+i ~ E f/IE+i 

i=m 2=pm+l 2=pm+l 
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We choose L and p large enough so that this is positive. Applying the Marcinkiewicz-Zygmund 
inequality and next Holder’s inequality with conjugate parameters (r//2, p/{p — 2)), we obtain 
for p > 2: 


pm 


ri/2 


E\X-EX\^ E 

i=pm-\-l 

/ n \ 


2/r? 


\ \i=m / \i=m 

n / n 

= E E| E 

i=m \i=m / 

Since E\Wf^ — EWf^\^ x var(lW^n)'^ < we conclude 


-p/{r]-2) 


r,/2-l 


E\X-EX\^ <^i 


(l/2-(l+2a))r, 


E' 


V/2-l 


-r)/(? 7 - 2 ) 


< ml-(V2+2a)77+77/2-l-77/2 ^ ^-{l/2+2a)v 


m 


hence the P{Em) are bounded by a multiple of and thus summable over m for p > 2. 

It follows by the Borel-Cantelli lemma that the event Em occurs at most finitely many times, 
with probability one. □ 


5. Discussion 

The model (1.1) can also be formulated directly in terms of the coordinates (/i,n) of fn relative 
to the eigenbasis ej^n of the prior covariance matrix Un- For On the orthogonal matrix with 
rows the eigenvectors ej^n of Un, the definition of fj^n gives 


OnXn — Onfn T On^n — 


(h,n\ 

f2,n 

\fn,n/ 


T On^^r, 


By the orthonormality of On the error vector On^n is equal in distribution to Sn, whence 
Yn = OnYn Can be considered a vector of observations in a normal mean model with mean 
vector (fi^n)- Under the prior on /, given c the vector {fi^n, ■ ■ ■, fn,n)'^ = O/^fn possesses 
a mean zero normal distribution with covariance matrix cO/^UnOn = diag(cAj^n)- Prior and 
data model both factorise over the coordinates, and it can be seen that under the posterior 
distribution given c the variables fi^n, ■ ■ ■, fn,n are again independent with 


Yn,C^M 


cXi 


1 + cAj 


.y. 

^ i,n, 


cXi^n \ 

1 + cXi^n J 


This gives a representation of the posterior distribution different from, but equivalent to (1.3). 
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In this form the model resembles the infinite Gaussian sequence model (or white noise 
model). A difference is that presently the sequence is of length n instead of infinite, and the 
parameter vector ■ ■ ■, fn,.n) changes with n, even it refers to a single true function /. 

The discussion in Section 4.1 shows that this difference is not trivial. 

Likelihood-based empirical Bayes and hierarchical Bayes estimation of the scale parameter 
c in the infinite sequence model were studied in [17]. Besides considering the finite sequence 
model, in the present paper we also study the risk-based empirical Bayes method and allow 
more general priors. A main difference is that we have focused on the coverage of the credible 
sets. Such coverage is also studied in [18], but only for the likelihood-based empirical Bayes 
method in the infinite-sequence model with AA(0, i“^“^")-priors and a taken equal to the 
smoothing parameter. The focus in the present paper on balls in the space of the finite vectors 
fn of function values allows us to make the connection to the correctness of a fraction of the 
credible intervals, as in Corollary 4. The present paper also differs in its technical details and 
proofs, in that our results are directly formulated in terms of the criterion that is optimized, 
whereas [18, 17] make the derivative of the criterion intercede. The present approach gives 
better insight and allows to state the contribution of the (discrete) polished tail condition 
more precisely, with the possibility of generalisation to the good bias condition (2.7), which 
is dependent both on the method and the prior. 

Throughout, we limit the estimator to the interval In- This is reasonable, since the optimal 
rate of rescaling for functions in a class of smoothness a satisfies cn^ n^, where 5 = m/{I + 
2a) € (0, m] (if a € (0,m) or a S (0,m/2) in the risk-based and likelihood-based methods). 

We consider the hierarchical Bayes only with the usual inverse Gamma prior on the scaling 
parameter. From the proof it is not difficult to see that the result extends to more general 
priors. For instance if ~ r(K, A), for some r > 0, then the theorem is again true, but with 
the term 1/c replaced by (1/c)^. A choice r < 1 does not change much, but the choice r > 1 
has an adverse effect on the rate of contraction for Sobolev classes: optimality is obtained 
only for a < (l/r-|-m — l)/2. 

The assumption that the errors in the regression model are normally distributed is crucial 
to define the posterior distribution and credible sets. However, the derivation of the properties 
of these objects uses only that the errors have mean zero and finite fourth moments. Thus 
the standard normal model may be misspecified. This is true in particular regarding the 
assumption of unit variance, although it would be preferable to extend our results to allow 
for a prior on this variance. 

The study of credible bands, rather than credible balls or credible intervals in a fractional 
sense, would require control of the bias of the posterior mean in a uniform sense. This involves 
properties of the eigenvectors of the priors and goes beyond the ‘T 2 -theory” considered in the 
present paper. The bias in the example of Brownian motion is considered in detail in [16]. We 
hope to employ this in the study of credible bands in future work. 
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7. Technical proofs 

In this section we give the proofs of Corollary 4 and Propositions 10, 15 and 16. 


7.1. Proof of Corollary f 


In the Bayesian model (1.2) we have Yn = for independent vectors and e„. The 

marginal posterior distribution of f{x) given c and Yn is the conditional law of given c and 
Yn- By the assumed Gaussianity, this is a normal law with mean the conditional expectation 
fn,c{x) = I Yn, c) and variance equal to 

sl{c, x) = var [W^ | c, Yn] = var - E(BT | c) | c] = inf E [{W^ - a^Ynf \ c]. 


When evaluated at a design point x = Xi^n, this is equal to the diagonal element of the 
posterior covariance matrix I — Hence the sum of the posterior variances over the design 

points is the trace of this matrix. It follows that for all i £ Jn we have 

slic, Xi^n) > - tr(/ - 

n ’ n 

where s^(c) is given in (2.6). It follows that for i £ Jn the radius Mrn{c, Xi^n) of the em¬ 
pirical Bayes interval Cn,ri,M{xi,n) is bounded from below (up to a universal multiple) of 

MZr^Sn{c)/y/n. 

The function / fails to belong to the empirical Bayes interval Cn,r),M{x) if and only if 
\f{x) — fn,cni^)\ ^ Mrnicn,T],x). Therefore, by Markov’s inequality 


-E 


n 


f{/ ^ Cn,r],M {Xi^n')} E 


IfiXi^n) fn,Cni^i,n)\ ^ \\fn fn,^ 


iGJn 


n 


iCiJn 


M‘^rl{cn,rj,Xi^, 


M'^zlslicn) 


As noted in the first paragraph of the proof of Theorem 17, s^(cn) is asymptotic to the square 
radius r^(cn,??0 of the credible balls of the form (1.7), for any rj' £ (0,1). Therefore, if the 
left-hand is bigger than 1 — 7 , then / ^ Cn,M',v ^ multiple oi Mzjj. By Theorem 3 this 

is the case with probability tending to zero if M' is sufficiently large, which it is if M is large. 
The result then follows, since 

~ l{/ £ Cn,rj,M{Xi^n)^ + ~ ^ ^ l{/ ^ Cn,r],M{Xi^n)^ = ~~ 

iGJn iGJn 


If the function / fails to belong to the hierarchical interval Cn,ri,M{x), then |/(x) — 
fn,cn{x)\ > Mrn{cn,'n 2 ,x), for c„ as defined in the proof of Theorem 27. The rest of the 
proof is similar to the proof of the empirical Bayes intervals. 

The assertions concerning the radii are immediate from the corresponding assertions of 
Theorem 3 and the equivalences Sn{c,Xi^n) ^ Sn{c)/y/n x rn{c,ri)/y/ri uniformly for i £ Jn 
under the extra assumption on the posterior variances. 
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7.2. Proof of final assertion of Lemma If 


That and behave as claimed is immediate from Lemma 44; we only need consider 

the behaviour of 2 - The derivative of this function is given by c c~^D^ 2 (c) and hence 
is asymptotic to c~^{cn^Y^^kn{c) uniformly on the interval for any —)• 00 . 

Here kn{c) = 1 + log(cn^) for cn^ < n™ and kn{c) = 1 + log(n^"*/(cn^)) for cn^ > n"*. Now, 
as cn^ > In ^ 00 , we have for cn^ < 


since 


pc pcn^ 

/ s~^{sn^)^^^kn{s) ds = / + logtt) du x log(cn^), 

Jo Jo 

fg log udu = log t — rnJfi!’^. Furthermore, for cnJ G [n™, n^"*] we have 

pc pcn^ 

/ s~^{sn^)^^^kn{s) ds nlogn + / (l + log— logn) dri 

Jo Jn'^ 

2 rcn^ 

= nlogn + m(l + log(n^'"/ri))u^/”^|^^ + m / du 

"■ in™ 

X (cn2)^/™(l + log(n2”^/cn2)). 


Combining the two displays we see that in both cases the left side is asymptotic to 
(cn^)^/”*A;„(c). This order does not change if we limit the integrals to the interval 
[ln/n‘^,c], for /„ —)• 00 slowly. It follows that D^_^ 2 {c) has this order, provided the integral 

j-L/n (^dL(^ g'^ fj^g ig Qf lower order. Since ^ Z]r=i latter 

integral is bounded by a multiple of /^, which is of lower order again if ^ 00 sufficiently 
slowly. 


7. 3. Proof of Proposition 10 


The proof is based on two lemmas. 

Lemma 37. For the functions in both (2.3) and (2.f) and any c and s < t in (0, 00 ) we have 

var[i?p„(c,/)] <T»i,„(c,/), 
var[i22,n(c)] < D2,n{c), 

Yai[Ri^n{s, f) - f)\ < - 2 -’ 

Td I \ D l+ll ^ (^ “ '*) ^2,n{s) 

Var[i?2,n('S) - R2,n{t)\ < --• 

Proof. For the risk-based remainder Rf'n given in (2.3) we have 


var 


[i2fjc,/)] =4^ 


/: 


J,n 


i=i 


(1 T C^j,n) 


4 <4D«„(c,/). 


The bound on the variance of the likelihood-based remainder in (2.4) is very similar. For 
R 2 ^n ™ (2-3) we have 


var[fi^„(c)] = 2 ^ 
i=i 


( 2 cAj> + 

(1 -t- cXj^n)^ 


< 


o \ ^ (cAj,n)^ 
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For the likelihood-based remainder in (2.4) we have 



in view of the inequality log(l + x) — x/{l + x) > x^/(l -|- x)^/2 for x > 0. 


The third and fourth assertions of the lemma follow by applying Lemma 47. For the risk- 


based remainder given in (2.3), we use the lemma with the choices: 

• for Ri n- (ct) /d) = (0, 2 ), Oj = 2fj^n, Uj = Zj^n and {S, 7 ) = (0, 2 ), where the sum in (8.4) 
becomes 

• for R 2 ^n- i ‘^1 /^) = (*^) 2)) = 1) Uj — ^jn ~ 1 7 ) ~ (2) 2), where the sum in (8.4) 

becomes D^^- 

For the likelihood-based remainder, given in (2.4), we use the lemma with the choices: 

• for Rf n- (®)/^) = (0) 1)) o,j = Uj = and (5, 7 ) = (0,1), where the sum in (8.4) 

becomes 41)^^, 

• for 7 ? 2 n- (“j/S) = ( 1 ) 0 )) = “ 1 ) Uj = Zj,n^ — 1 and ( 5 , 7 ) = ( 2 , 2 ), where the sum in 

(8.4) will become which is bounded by a multiple of D^^- 

This concludes the proof. □ 

Lemma 38. For the functions in both (2.3) and (2./) and any s < t in In we have 





Proof. By Lemma 46 with (a, /3) = (0, 2) and as in (2.3) we have 



The function in (2.4) can be treated similarly, with the choice {a, ft) = (0,1). 


Applying Lemma 46 with {a, ft) = (2,0) to D^nic), we find 




n . n 



The right side is s^(s), by definition (2.6). Applying the mean value theorem to D^n ia (2.4) 
we find for some s < f <t, 




This concludes the proof. 


□ 
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Proof of Proposition 10. Applying Lemmas 37 and 38, we see that for any s < t in 


var 


DnisJ) Dn{t,f) 


< 2 var 


DnisJ) 


+ 2 var[i?i,„(t,/)] 


DnisJ)-DnitJ) 

DnisJ)DnitJ) 


< 

r\-/ 


(t-s)^ L»?„(s,/)+ 4 (s) 


< 


s’^DjsJ) s’^DjtJ) DlisJ) 
it - s)^ 


since DnisJ) > D 2 JS) > (sre)^/™ x sjs) by Lemma 14. Similarly, applying Lemma 37 we 
see that 


D„(aJ) ) ~ D„(sJ) ~ (snji/m 

by Lemma 14. The resnlt for Ri^n follows from the preceding two displays, by application of 
Lemma 48. The assertion for ii 2 ,n is proved analogously, from the other parts of Lemmas 37 
and 38. □ 

7 . 4 . Proof of Proposition 15 

In addition to Lemma 38 we need the following lemma. 

). For any c and any s < t in (0, 00 ) we have 

r[i?3,n(c,/)] <4D^^nicJ), 
ir[i? 4 ,„(c)] < 2D^Jc), 

it — s)^Df is) 

r [i?3,n(s, /) - R4,nit, /)] < ^ 


Lemma 39. 


var [ 
var 


var 


it — s)^D^ is) 
var[i?4,„(s) - Rijt)] < --. 


Proof. For the first two inequalities we compute 

■ [Rsjc, /)] = 4 f; /), 


var 


var 


[RaJc)] = 2 < ^Dlnic). 


The third and fourth inequalities follow by application of Lemma 47 with the following choices: 

• for R'iy. (a,/3) = (1,1), ctj = —2/j_„, Uj = Zj^n and ( 5 , 7 ) = (0,2), where the sum in 
(8.4) becomes 4Zlf^. 

• for R^y. (a, j5) = (2, 0), Oj = 1, Uj = — 1 and (5, 7 ) = (2, 2), where the sum in (8.4) 

becomes D^n- 

This concludes the proof. □ 
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Proof of Proposition 15. Using Lemmas 39 and 38, we have for s < f in 


(Rz,n{sJ) -R3,n(L/) 

mtj) 

<2varl --) +2var[ii3,„«./)] ( /) 


^ jt- s)^ ^ (t-sf + si{s) 


< 


s^DRisJ) s^DR{t,f) DRisjy 

it - s? 


since and 0^(1, f) > > (gj^)i/m ^ s‘^{s) by Lemma 14. Similarly, we have 

by Lemma 39 


var 


( RsA^J) 


< 


< 


\D^{s,f)J - DR{sJ) - MV-’ 

by Lemma 14. The proposition with Dn = follows by an application of Lemma 48. 

Since Dj^ > D^/2, this immediately implies the proposition for the likelihood-based 
norming. The assertion for i? 4 ,n is proved analogously, from the other parts of Lemmas 39 
and 38. □ 


7.5. Proof of Proposition 16 
Lemma 40. For s < t we have 


|4W -4(s)| 


< 


\t - s s„(s) 


Proof. This is immediate from the definition of in (2.6) and Lemma 46 with (a, /3) = 

( 1 , 0 ). □ 

Proof of Proposition 16. It is immediate from the dehnition of that 


E 


Nn{c) • 

L4(c) 


= 0, var [iV„(c)] < s2(c). 


Applying Lemma 47 with (a,/3) = (1,0), Oj = 1, (7,(5) = (1,1) and g = we find that for 
s < t 

{t - sfsl{s) 


var [Nn{s) - iV„(t)] 


< 


It follows that 

'Nnis) Nr^it) 


var I I < 2var ( + 2var[iV„(t)] 


4 (s) slit) 


sUs) 

{t - s)2 ^{t- s)2 


slis)slit) 


< 

^ c2 C.2 


s'^slit) 


< 


(t - s)^ 


^2+1/m^l/m ’ 


by Lemma 14. The proposition follows by an application of Lemma 48. 


□ 


imsart-generic ver. 2011/11/15 file: polished.tex date: April 30, 2015 
























Sniekers et al./Credible sets with Gaussian process prior 


39 


For Brownian motion, we can gain more insight in the behaviour of (part of) the function 
^2 ■ 

Lemma 41. For the Brownian motion prior and c G [logn/n,n], 

log det T,n^c ~ \/cm. 

Proof. We want to find the determinant of the n x n matrix 

\ /2 + 


S = c 



1 

1 

n+ 

1 

c 

1 

n+ 

2 


1 

n+ 

2 


n+ 

n+ 


n+ 


1 

2 





n+ 

n+ 


1 _|_ n—1 
c n+ 

n—1 

n—1 

\ 

1 

2 


n+ 

1 71 

c W 

n+ 

n+ 


n+ 


c 

n+ 


-1 


-1 2 + -^ -1 
n+ 


0 


-1 


2 + 


n+ 


0 \ 
0 

-1 


-1 1 + 

n+, 


If we denote this determinant by dn, we see that 


dn — 1 2 + 


n+) 


dn—l dn—2^ 


with = 1 + ^ and ^2 = r^) “ ^^^at this is the 

relation as (2.2) in [16]. The solution is given by dn = AX'f + BXf, where 

+ cn+(3 — A) + nl(l — A) 


same recurrence 


A = 


(A+-A_)A+n2+ 


A± = 1 + 4^± 


2re_i 


2^ 


4 H-. 

n+ 


Note that A+A_ = 1. Since ^ 0 uniformly in c G In, we have A± —)• 1 and 

A= + „(!) = 1, 

(A+ - A_) ^0(4 + 0) 2 

It is easy to see that B = X-A ~ A. Furthermore, we have 

2 


log(A+) = n 
Finally, we have 

The result follows. 


9 ,/^V4T0 9 

2 2 2 


+ 0(03/2) 


= nV0 + O(n03/2). 


logd, - iog(AA:^) = log (^1 + |a2_-) ^ 0. 


□ 


8. Technical results 

Lemma 42. Let Di : In —)■ (0, oo) be a decreasing function and D 2 : In (0, 00 ) an 
increasing function. Suppose that there exist a,b,B,B' > 0 such that 

Di{Kc) < ii:"“Oi(c), for any K > I, (8.1) 

B'k^D 2 {c) > D 2 {kc), > Bk^D 2 {c) for any k < 1. (8-2) 

Let c satisfy Di{c) = 02 (c), and for a given constant E > 1, define A = |c : {Di + 02)(c) < 
O (Oi + 02)(c)}. Then 


imsart-generic ver. 2011/11/15 file: polished.tex date: April 30, 2015 




















Sniekers et al./Credible sets with Gaussian process prior 


40 


(i) Di(c) < B for every c G A. 

(ii) A C [(2^)-V“c, {2EB'y/^c]. 

Proof, (i). If c > c, then Di{c) < D 2 {c), since Di and D 2 are equal at c and decreasing 
and increasing respectively. The inequality in (i) is then satisfied, since > 1. If 

c < c, then by (8.1) with K = cfc we have 

(c/c)“I)i(c) < Di{c). 


If c G A, then also 


Di{c) < {Di + D 2 ){c) < E{Di + D 2 ){c) = 2EDi{c) 

by the definition of c. Concatenating these inequalities, we conclude that [cjcf' < 2E^ or 
c > bic for bi = (2£l)“^/“ < 1. Then, by monotonicity and (8.2), 

02 ( 0 ) > D2ibic) > Bb\D2{c). 

This is equal to Bb\Di{c) > Bb\/{2E)Di{c) by the second last last display. This concludes 
the proof of (i). 

(ii). The lower bound on A in (ii) is equivalent to the inequality c > bic, which was already 
obtained in the preceding proof of (i). For the upper bound we hrst note that for every c G A 
we have ^> 2 ( 0 ) < Di{c) + ^> 2 ( 0 ) < E{Di + D 2 ){c) = 2ED2{c), by the definition of c. If c > c, 
then (8.2) gives that the right hand side is bounded above by 2EB'{c/c)^D2{c). Concatenation 
of the inequalities gives that 1 < 2EB' (c/c)^. □ 

The following lemma is applied throughout to handle the sums that occur in both the 
deterministic and stochastic terms of L. 


Lemma 43. Let 7 > —1, m > 1 and 1 / G M such that 7 — mi' < —1. Then 

n ..y 

^ (j- + cnr “ + „(1)) (8.3) 

uniformly for c G [Z„/n, as n ^ 00 , for any In —>• 00 . The constant is given by 

_ r u 

{u-^ + iy 

Furthermore, the left side of (8.3) has the same order as the right side uniformly in c C 
[ln/n,n'^~^] , for any In —>■ 00 , possibly with a smaller constant. 

Proof. If 7 < 0, then the function t 1 —?■ g{t) = E/(P^ + cn)^ is decreasing on [0,oo), while if 
7 > 0 the function is unimodal with a maximum at k{cn)^^'^ for the constant k = {pfj{mv — 
7 ))i/m_ £j.g|. have 


+ cnY ~ ^ (j™ + cnY ~ Jq {t^ + cnf 


E 


■ dt, 


while in the second case 
E 


pn J.7 *7 rn y.7 

/ 7-^ dt — g(fc(cn)^/”*) < dt + g{k{cnY^'^). 

{t^ + cnY ^ + cnY ~ Jo {f^ + cnY ^ > > 
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By the change of coordinates = {cn)u^ we have 


/ 


+ cnY 


pn/{cnY^ 

dt = / 

J a/(cn)^ 


Kcny/m (u™ +1)' 


■ du. 


If cn —>■ oo with (cn)^/™ <C n, then for both a = 0 and a = 1 the integral on the right 
approaches which is finite under the conditions of the lemma. The maximum value in 

the second display satisfies g{k{cn)^/^) < and hence is of lower order than the 

right side of the preceding display if cn —)• oo. This proves the hrst assertion of the lemma. 
For c as in the second assertion we still have that cn ^ oo, so that the lower limit of the 
integral tends to zero, but the upper limit n/(cn)^/™' may remain bounded, although it is 
bigger than 1 by assumption. □ 

Lemma 44. For 7 > — 1, m > 1 and n S M such that 7 — mn < —1 we have 

71 71 / . ,\/y 

^ ^ + cn^) 

i=i j=i > 


{cn^yilra-v-r\lm ^ 


(1 + log(cn^)) if cr? < 

(1 + log(n^™'/(cn^))) i/cn^ > 


uniformly for c € [In/n as n ^ 00 , for any /„ —>• 00 . 

Proof. Since cn^ < + cn^ < 2 cn^ if (ij)™ < cn^ and < (ij)"^ + cn^ < 2{ijy 

otherwise, the double sum is up to a constant 2'^ bounded above and below by 


n n / ■ ■\'y 

EE EEC.) 


'y—TTii' 


{ij)'^<C71^ 


i=l j=l 


Since cn^ > In ^ 00 , the first sum is never empty; the second is empty if cn^ = n^™' takes it 
maximally allowed value. To proceed we consider the cases that N := is smaller or 

bigger than n separately. If < n, then the second sum splits in two parts and the preceding 
display is equivalent to 


N Nji 


EE^lr+E E E Eci) 

i=l j=l i=l j=N/i+l i=N+l j=l 

2=1 2 = 1 2=A^+1 


N 


'y—mu 


7—mi/ 


X {logN)N^+^-^^ + (logiV)Ai^-™^+^ + ATT'-^^+i. 

If > n, then the first sum splits into two parts and we obtain the equivalent expression 

N/n n y,,y.y n N/i 


Z—/ / ^ AITUU Z—/ / ^ AjTnv Z—/ / ^ 


7—mz/ 


2=1 ^ = 1 i=N/7i-\-l j=l 2 =A^/r 2 +l j'=A^/2+1 

71 ./V / A r / ^ 


Pn.+i 


E 




z7iv/i)7+i _ 

/ ^ Afmu / ^ 


■■y-mo 


i=l 

jyj-'y—mu+l 


i=N/n+l 
2 


{N/i 

i=N/7i-yi 

+ (log(n7iV))iV^-™'^+i + (log(nViV))iV^+^' 
These bounds can be written in the form given by the lemma. 


17 —mz/+l 


□ 
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Lemma 45. For m > 1 and z/ E M such that —mv < —1, we have 


EE 


1 


[cn'^) 


2 \— 


uniformly for c G [In/Pas n ^ oo, for any In —>• oo. 

Proof. Since the function (s, t) l/((s^ ++ cn^)^ is decreasing in s and t, we have 

" pn pn 2 


§ S ((«2 + cn2)' 


and 


EE 


1 


< 


> 


pn pr 

Jo Jo 

/•n 1*72 

Jl Jl 


((s^ + 


1 


((s^ + + cn^y 


iPifPi {P + cn^y 

Rewriting the double integrals in polar coordinates, we see that 


ds dt 


ds dt. 


vr 


f 

JP': 




1 


< 


TT 


py/^n 

Jo 


J ^/2 {r"^^ + cn^Y ~p{P + P)"^ + cpy ^-^0 + cn^) 

By the change of coordinates r = (cn^) we then have 

fbn „ . . , rbnl{cn‘^/tC'm) 


■ dr. 


/ 


^j.2m ^^2^ 


■ dr = (cn^) 


2\ —o+llm 


2/(cn2)l/(2m) (u2"* + 1) 


■ dtz. 


Since cP oo the lower limit of this integral tends to zero. Combining this with the fact 


that the upper limit is bounded from below by b, the result follows. 


□ 


The following three lemmas are used to establish uniform bounds on the stochastic remain¬ 
der terms. 

Lemma 46. Consider a function g : (0, oo) —)• M o/ the form 

g{c) = -- 


(1 + cXj,nP+h'' 

where ce, /3 > 0 are integers. Then, for 0 < s < t < oo, 


\gis)-g{t)\ < 


s — 


sX 


‘J,n 


S (1 + sA,>)2V(l+/3) • 

In particular, if jd > 2, then \g{s) - g{t)\ < p- 

Proof. We apply the mean value theorem to the function h{x) = x"/(l + Note that 

for X > 0 we have 


|h'(x)| = 


x" ^[—/dx + a) 


a—l 


< 

r\j 


X X 

(1 + xY+°^+h "*"(! + x)^+"+/^ 


< 


(1 + x)^+"+^ 

1^1^ 1 
{1 + xY+h + (1 + x)2+/5 ~ (1 + x)2V(l+/3) 
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Hence 


l£/(s) ^ |s 


t\ 


'3,n 


(1 + sA,>)2V(l+/3) 


s ^1 sXj^Yi 

S (1 + sA,>)2V(l+/3) ■ 


□ 


Lemma 47. Consider the stochastic process {U{c) : c > 0) given by, for constants aj, 
i.i.d. mean-zero random variables with finite variance Uj and integers a,/? > 0, 


wc) = E 

i=i 


aj{cXj^n)°^ jj 
(l + cAj>)“+/3 r 


Suppose that for some 7 ,5 G {0, 1 , 2 } and some non-negative function g we have 




(8.4) 


Then, for 0 < s < t < 00 , 


(s - tfig{s) 


var(C/(s) - t/(t)) < 2 


Proof. We consider 


n 


var[[/(s) — U{t)] = g; 

i= 


{sXj^nT {t^j,nY 


^ ^ L(l + «V)“+/3 + 


Applying the previous lemma, we see that 


We conclude 


jsXj^n)'^ 

(1 + 


(1 + tAj-„)“+/3 


< 

rs_/ 


s t\ sXj^ji 
s (1 “ 1 “ 


var[t/( 5 ) — U{t)] 


< 




n 


E 


a‘j{sXj^nY 

(1 + sXj^nY 


< 


{s — t)2 ^ 2 {sXj^nY 

,2 + 


which holds for any 7 , 5 G {0,1, 2}. The result follows. □ 

Lemma 48. Let In ^ 00 be a given sequence of numbers. If Un = {Un{s) : s G In) are 
continuous stochastic processes such that for all s < t in a closed interval In C [Zn/n,oo) and 
some a> t) we have 


E[t/„(») - c /„((7 < 1 ^. 

then sup^g/^ lC/„(s)| tends to zero in probability. 

Proof. Write In = [o„, 6 „]. For a given interval [sqj^o] <G In we have E[Un{s) — t/n(t)]^ < 
dQ{s,t), for do the metric 


do{s,t) = Ko\t - s\, Ko = n 
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The do-diameter of [so,to] is — sol and the covering number N{u, [soTo])do) is bounded 
above by (iiiol^o “ sol/n) V 1. Therefore by Corollary 2.2.5 in [21], with we have 


E sup [Un{s) - Un{t)]‘^ < Ko\to - Sol"^ 

S,t&[so,to] 


jto/SQ - Ip 
(nso)“ 


Fix M so that < l/a^ < 2^ and N so that 2-^“^ <hn < ■ Define S-m = ^n, sn = 

and Si = 2* for i G {—M + 1,..., — 1}. Then s-m < S-m+i < ■ ■ ■ < sn partitions In- Since 

Sj+i/sj — 1 < 1 for every i (in fact, equal to 1 except for the extremest values), we then have 


E sup Un(s)‘^ < 2E max 

seir, 


sup \Unis) - Un{Si)\‘^ + Uni-Si)'^ 

_SE[Si,5i+l] 


< 


E 


V 1 


i=-M 
Af-1 


(nsi)“ (nsj)“ 


^ \ ^ ^ i\Ma ^ 

~ n“ ^ ~ re“ 1 


_ ‘2—a{M^N) 


< 


i=-M 

1 / 2 1 


- 2-“ 


< 


1 2'^ 


n^\an^ 1 - 2 -“ “ 1 - 2 -“’ 

by definition of M. This tends to zero, since In —)• oo. 


□ 
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